another man's ramblings on code and tech

Elasticsearch Performance Tuning

Working as a performance investigation developer at my company using ELK for monitoring a 5 server system constantly shows high traffic load to our Elasitcsearch cluster. Near the beginning of my tenure as a developer I was constantly troubleshooting seemingly random node toppling and incredibly slow retrieval times for ES queries. This forced me to find optimizations to better the speed of the cluster given almost no new hardware. My top optimizations were:

1. Always lock your memory

This setting will help increase the speed of query execution. It's a simple setting you can find in your Elasticsearch's _elasticsearch.yml _config file. In ES 5 it looks like _bootstrap.memory_lock: true, _but in older versions I know it was the _mlockall _setting. This will make Elasticsearch lock it's allocated space in RAM such that the OS won't try to do any of it's normal memory tricks to optimize usage.

2. Make sure your heap memory size is set right

Your heap size is the amount of memory being provided to Elasticsearch's JVM. This is set to about half of your system's total memory by default, but it can be controlled and is critical to being able to handle larger loads. In Elasticsearch 5 you can find the file containing it's settings in _/etc/elasticsearch/jvm.options _on the line's containing "-Xmx" (maximum heap size) and "-Xms" (minimum heap size). Long story short, you want these values to be the same. You also want them to be tuned to the perfect amount given the roles of the machine you're on. I could write a whole post just on tuning heap memory sizes, but the most important factors would be your total memory for the machine, the total memory it needs running under no load, and the worst case of memory usage by everything _other _than Elasticsearch. For example, if you have an Elasticsearch node with 4 GB of memory whose role is only to act as slave on a CentOS minimal install, I'd say you're safe allocating 3GB to your heap and leaving the 1G for everything else. However, if that same machine had to provide Logstash services, or pretty much any other role that could use a large amount of memory, I would adjust the heap size for ES based on the competing needs of the other roles. A good rule of thumb is to set your heap size to half of the installed memory, just remember that can go wrong if other services will need more than the other half of your memory.

3. Increase Elasticsearch's queue sizes

This is another tuning option for increasing the total load your ES cluster can handle. Elasticsearch draws on threadpools to fulfill queries and requests. Each type of request has an associated threadpool queue which holds it's unprocessed messages until a thread is assigned to said message from the pool. The types include _search, index, bulk, _and many, many more. Increasing this queue size increases the load your instance will tolerate before toppling (assuming it could handle the load reasonably well in the first place). You can control these settings through simple YAML syntax at the bottom of your _elasticsearch.yml _configuration file. For example:

        queue_size: 60000
        queue_size: 60000
        queue_size: 60000
        queue_size: 60000

This increases the queue size for each of my threadpools to 60K, meaning 60K messages can be waiting for processing on any of those given types of request before ES self-destructs and kills the originating query. If you keep these settings default then you'll find ES will not resolve large-data returning queries.

4. Increase your ephemeral port range

Elasticsearch depends on ports to send requests; many of them, in fact. The other major limiter for your completion of an Elasticsearch query, other than your queue sizes, would be the range of ports your computer has. The more ports available the more requests Elasticsearch can do concurrently (of course, memory, storage space, and CPU influence this as well). It also provides more ports to your computer in general, ensuring that Elasticsearch doesn't starve other services or programs.

In Linux this is pretty easy to achieve with a few commands, although I have no idea if this is the case in Windows. My current setting for ephemeral ports for my CentOS boxes is _net.ipv4.ip_local_port_range = 15000 61000, _or a range of ports from 15000 to 61000.

5. Reduce _tcp_fin_timeout_and _tcp_keepalive_time _TCP settings

Finally, reducing the amount of time it takes for a port to close and refresh after a connection has completed can also increase your overall performance. I set both of these to zero on our internal cluster and there seems to be little corruption or confusion do to any sort of port mangling.

And that's it! I hope this aids you in your Elasticsearch performance tuning future.

Date: February 22nd at 1:14pm