TL;DR
ElasticSearch Backup (snapshot / restore) on AWS S3
Steps / Configrations for ES snapshot / restore
Use elastic
curator
to manage snapshots (create / remove)Docker image for ES Curator to manage Elasticsearch snapshots - davidlu1001/docker-curator
Overview
The purpose of this blog is to investigate the possible solutions to backup and restore ES indices. So that in the event of a failure, the cluster data can be quickly restored and minimized the business impact.
Backup content
Generally the backup content would include:
Cluster data
State configuration: includes cluster / index / shard settings
1 | # index settings |
P.S.
Transient settings are not considered for backup
Also can capture these cluster settings in a data backup snapshot by specifying the
include_global_state: true
(default) parameter for the snapshot API.
Snapshot Repository
In order to enable backup for ES cluster data, a snapshot repository must be registered before performing snapshot and restore operations.
Snapshots can be stored in:
Local shared filesystem: NFS
Remote repositories: backed by the cloud (AWS / Azure / GCP) or by distributed file systems (HDFS) with repository plugins
Considering that if use shared filesystem NFS, additional setup / configuration is required, and the cluster needs rolling restart to take effect. As all nodes must have access to the shared storage to be able to store the snapshot data. Therefore, from the perspective of complexity and operational safety, will not consider it (even the cluster only uses a little of the disk space).
And luckily we’ve got repository-s3
plugin installed on all nodes in the ES cluster, so rolling restart is not needed if using S3 as snapshot repository.
1 | $ ES_PATH_CONF=/etc/elasticsearch /usr/share/elasticsearch/bin/elasticsearch-plugin list |
So would prefer to use AWS S3 as a remote snapshot repository with existing S3 repository plugin.
Backup retention strategy
Based on backup granularity and retention time, we can choose the following strategies based on different scenario:
Takes daily snapshots (during the hour we specify) and retains up to 14 of them for 30 days.
Takes hourly snapshots and retains up to 336 of them for 14 days.
Can carefully start with daily backup (the initial backup may take relatively longer time) and keep monitoring, and then could try hourly backup if it’s feasible, in order to provide more granular recovery points.
Incremental snapshot mechanism
The snapshot functionality will simply remove the data that was only referenced by the deleted snapshot but will leave the data that is still required for other snapshots in the repository.
So it would be safe to delete the first initial snapshot without affecting the latter snapshots.
Ref: https://www.elastic.co/blog/found-elasticsearch-snapshot-and-restore
Backup cluster’s data
Will take snapshot per index, instead of backup the whole cluster, which can potentially bring the flexibility for the restore step.
The snapshots are taken incrementally. This enables us to take frequent snapshots with minimal overhead (the initial backup may take relatively longer time, based on the amount of data). The more frequently we take snapshots, the less time it will take to complete the backup.
The snapshot process is executed in non-blocking fashion. All indexing and searching operations can continue to run against the index that is being snapshotted.
Snapshot Prerequisites
- Check repository-s3 plugin is installed
1 | $ ES_PATH_CONF=/etc/elasticsearch /usr/share/elasticsearch/bin/elasticsearch-plugin list |
Create an S3 bucket to store the snapshot (e.g.
es-backup
)Check current iam instance profile role on ES cluster
Update existing IAM role and setup proper S3 Permissions
1 | { |
- Need to update the role permission for both ES master / data node
Note:
Otherwise will get the repository_verification_exception error when trying to register the snapshot repository
1 | "root_cause" : [ |
- Register snapshot repository per index with base_path (prefix) in S3 bucket for ES cluster (one-time operation)
1 | # use `base_path` under S3 for `service-A` |
- Register snapshot repository for each index (with base_path: same S3 bucket, but different directory)
1 | # e.g. for .elastichq |
- Check or Verify snapshot repository
1
curl localhost:9200/_snapshot?pretty
Backup operation
The snapshots lifecycle management (SLM) feature is introduced and natively supported in ES version 7.5.0
, so for lower version need to self-manage the snapshots / retention policy.
Can either use existing tool curator (with cronjob) on EC2 instance (e.g. gateway / master node), or can consider using ECS cronjob (scheduled task) to schedule the backup to avoid single point of failure.
For curator version compatibility, should be safe to choose the latest version 5.8.3
to support ES 6.X
.
Backup operation with ES snapshot API:
1 | # e.g. backup specified indice |
Useful Config for Snapshot / Restore Process
max_restore_bytes_per_sec
Throttles per node restore rate. Defaults to unlimited. Note that restores are also throttled through recovery settings.
max_snapshot_bytes_per_sec
Throttles per node snapshot rate. Defaults to
40mb
per second.
chunk_size
Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: 1GB, 10MB, 5KB, 500B. Defaults to
1GB
.
buffer_size
Minimum threshold below which the chunk is uploaded using a single request. Beyond this threshold, the S3 repository will use the AWS Multipart Upload API to split the chunk into several parts, each of buffer_size length, and to upload each part in its own request. Should between 5mb to 5gb. Defaults to the minimum between 100mb and 5% of the heap size.
Restore cluster’s data
Restore Prerequisites
Confirm the target index has same number of shards as the index in the snapshot
Close the target index need to be restored
Reference
By default, the cluster state is not restored. To include the global cluster state, need to set
include_global_state
totrue
in the restore request body (if include the cluster state in the previous backup operation), but usually we don’t want to backup the cluster state.An existing index can be only restored if it’s closed and has the same number of shards as the index in the snapshot.
The restore operation automatically opens restored indices if they were closed, and creates new indices if they didn’t exist in the cluster.
Restore operation
In order to restore the index from a snapshot:
- Update index settings to speed up restore process:
1 | # before restore |
- Restore snapshot from S3 (Set the index settings back at the same time)
1 | # e.g restore from specific snapshot service-A |
P.S.
Use the
index_settings
parameter to override index settings during the restore processSet
include_aliases
tofalse
to prevent aliases from being restored together with associated indices
Curator Usage
use curator_cli
Examples:
Here the variable "${DRY_RUN}"
can be either "--dry-run"
(for dry-run mode) or ""
(empty, means without dry-run)
- create snapshot
1 | # create snapshot for index(es) in the same repo |
remove snapshot
1
2
3
4
5
6
7
8/usr/local/bin/curator_cli \
${DRY_RUN} \
--host "${ELASTICSEARCH_HOST}" \
--port 9200 \
delete_snapshots \
--repository "${REPO_NAME}" \
--ignore_empty_list \
--filter_list "[{\"filtertype\":\"age\",\"source\":\"creation_date\",\"direction\":\"older\",\"unit\":\"${UNIT}\",\"unit_count\":\"${UNIT_COUNT}\"},{\"filtertype\":\"pattern\",\"kind\":\"prefix\",\"value\":\"${INDEX_PREFIX}\"}]"restore snapshot
1 | # close first |
use curator
Examples:
Here the variable "${DRY_RUN}"
can be either "--dry-run"
(for dry-run mode) or ""
(empty, means without dry-run)
1 | /usr/local/bin/curator --config /etc/curator/config.yml "${DRY_RUN}" /etc/curator/actions.yml |
actions.yml
for snapshot
:
1 | # snapshot |
actions.yml
for restore
:
1 | actions: |
docker-curator
The Docker image for ES Curator (to manage ES snapshots
) could be found here - davidlu1001/docker-curator
This image keeps up to date with curator releases 5.8.3
. It is also based on minimal alpine image.
Features
- Upgrade curator to version
5.8.3
- Add support for snapshot / restore (use
curator_cli
for single index scenario) - Add support for snapshot / restore
ALL
indexes for ES usingcurator
with actions rules. This would be useful when:- too many indexes (can not match with
prefix / regex
pattern) forcurator_cli
- if use different snapshot repository per index
- for accident recovery scenario to restore ALL indexes
- too many indexes (can not match with
- Add
DRY_RUN
mode - Rewrite Dockerfile and use
alpine
to reduce image size (withpython3
)
Usage
Image entrypoint
is set to customized script, need to pass paremeters to CMD
, can also support override ENV
Default ENV value:
1 | TYPE=snapshot |
e.g.
1 | # Snapshot single index with DRY_RUN mode, and delete snapshots 14 days ago |
Pass ENV
:
1 | - ELASTICSEARCH_HOST: default is `elasticsearch` |
e.g.
1 | # Snapshot single index without DRY_RUN mode, and delete snapshots older than 1 minutes ago |
References
Elastic - Backup Cluster
S3 repository plugin
Repository settings
ES - Snapshot and Restore (incremental snapshot mechanism)
Elastic Curator