Posted onInSRE
,
ElasticSearchViews: Disqus: Symbols count in article: 12kReading time ≈11 mins.
ElasticSearch can be a beast to manage. Knowing the most used endpoints during outages or simple maintenance by heart can be as challenging as it is time consuming. Because of this, this How-To article will layout some handy commands. In theory you could run them from any given node (data, client or master), however I’d recommend running them from a master node.
ssh into any master node (pro-tip: master instances are the ones within the master autoscaling group)
Elasticsearch nodes can take one ore more roles. Here we are using the following node types:
Master
The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node. Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process.
Data
Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.
Can be subdivided into warm and hot nodes if wanting to Implement a Hot-Warm-Cold Architecture for ES
Client
Can only route requests, handle the search reduce phase, and distribute bulk indexing. Essentially, coordinating (aka client) only nodes behave as smart load balancers.
Cluster health
To check the overall cluster health (green, yellow or red) and keep polling for status just type cluster-status.
Elasticsearch logs live in /var/log/elasticsearch/<cluster-name>/<cluster-name>.log. See the alias tail-es above.
Listing allocations
To list the allocations per host, just type cat-allocation. This will show things like:
shards per host
used disk
available disk
indice size
total disk size
disk usage percent
This is an important metric. as a rule of thumb, once disk usage reaches 85% things starts to go wild. ES will not allocate new shards to nodes once they have more than 85% disk used.
Sometimes shards can get corrupted or entirely lost. This will put the cluster in a RED/YELLOW state. If no other node have the unassigned shard and index, then the only option will be to delete them. To do so, run the alias del-shards-unassigned mentioned above or the command below:
To list the nodes participating in a cluster, just type cat-nodes. The node with a star (*) is the actual master within that cluster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
elastic-master:~$ cat-nodes host ip heap.percent ram.percent load node.role master name 172.21.38.167172.21.38.16767998.84 d - elastic-data-0b00da1daa6c3e055-production 172.21.5.57172.21.5.5722994.27 d - elastic-data-096596c27d5ea41b8-production 172.21.39.60172.21.39.6048910.00 - m elastic-master-033de9aa1c6f03b8a-production 172.21.22.28172.21.22.2874996.41 d - elastic-data-015b67ccc6632775b-production 172.21.39.147172.21.39.14721995.07 d - elastic-data-0a7381067cc43e698-production 172.21.36.186172.21.36.18625980.06 - - elastic-client-0bdfaa8f29b51ee5c-production 172.21.6.250172.21.6.25027870.02 - - elastic-client-0579bf60b8746b365-production 172.21.20.96172.21.20.969840.01 - - elastic-client-0a25e84d941f3446f-production 172.21.5.138172.21.5.13862996.83 d - elastic-data-07aa318c82d04e61c-production 172.21.7.188172.21.7.18820930.00 - * elastic-master-049a20b4645a557eb-production 172.21.21.160172.21.21.16017997.49 d - elastic-data-0abf96b0106bcba98-production 172.21.5.32172.21.5.3266996.33 d - elastic-data-05a9cd4cea5328486-production 172.21.22.253172.21.22.25338850.00 - m elastic-master-0644394056fa488a8-production 172.21.21.209172.21.21.20919995.47 d - elastic-data-02b5fb2c632eacc9f-production 172.21.39.249172.21.39.24961994.11 d - elastic-data-09512dff9c057e62b-production
Stucked nodes
We have come across an issue where one or more nodes get stucked, preventing the master from reallocating shards properly. This has happened multiple times. To find out which nodes are stucked you can either search for the exception ReceiveTimeoutTransportException from logs or search in Kibana if applicable.
Decomissioning Node
Node decomission is something you do when you “empty” a node, a.k.a, force shard reallocation. To do that just use one of the commands bellow, depending whether you want to exclude a node by ip or by name. Using name is the preferred way.
Both attributes, ip or name, can be retrieved from the cat-nodes alias mentioned above. You usually want to do this for one host at a time.
1 2 3 4 5 6 7 8 9 10 11 12 13
# exclude by IP curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" :{ "cluster.routing.allocation.exclude._ip" : "172.19.22.9" } }'
# exclude by NAME curl -XPUT localhost:9200/_cluster/settings -d '{ "transient" :{ "cluster.routing.allocation.exclude._name" : "elastic-data-05a4f739b4b5fc4ab" } }'
More info could be found: ElasticSearch - cluster-level shard allocation-filtering
# enable first - in case the shards were disabled before for some reason. # it's important because the next task will wait for the cluster to become green. # If the shard allocation is disabled the cluster will stay yellow and the tasks will hang until it times out. -name:Enableshardallocationforthe {{ service }} cluster uri: url:http://localhost:{{es_http_port}}/_cluster/settings method:PUT body_format:json body:"{{ es_enable_allocation }}"