See how TST Media developers keep Ngin running in top-notch shape by measuring changes against live production traffic.
As part of the Ngin Ruby 1.9.3 upgrade effort, the Infrastructure team at TST Media went through a rigorous process of load testing Ngin on Ruby 1.9.3. During this load testing we tweaked a variety of configurations with the end goal of maximizing our performance. Here we explore the following tweaks in depth:
As outlined in “Load Testing Production Traffic With Em-Proxy”, we used em-proxy to duplex our live production traffic such that the load testing results were fully representative of real traffic. By combining this approach with New Relic’s excellent toolset, we had tremendous insight into how changing the above variables affected performance. We fine tuned our entire setup against production traffic without affecting the real production traffic or our users' experience. We found this technique to be extremely powerful and a gigantic step up from a standard load testing approach such as using a service like loadimpact.com or generating traffic using apache benchmark.
We approached this like a scientific experiment. We had several variables to tweak, many of which were possibly dependent on each other in some way. We followed two basic principles core to any scientific experiment:
As our “control” we used the same setup that we had in place for Ruby Enterprise Edition, Ruby 1.8.7 2011.12 (REE), which was the following:
After tweaking a single variable via chef, we waited about 30 minutes to get enough incoming data to work with. We then used New Relic as our measuring tool. Once again, through the use of emproxy our measuring was against live production traffic! We compared performance changes due to each variable tweak to the previous configuration as well as to the real production traffic operating with REE. We were careful to watch out for legitimate changes in traffic that could skew our comparisons.
Ruby 1.9 allows for configuring the following variables, much like REE: RUBY_GC_MALLOC_LIMIT, RUBY_HEAP_MIN_SLOTS, and RUBY_FREE_MIN. See Engine Yard's Tuning the Garbage Collector With Ruby 1.9.2 for more details. The most important of these is the RUBY_GC_MALLOC_LIMIT which most Ruby apps of any size can see performance benefits from modifying. This represents the number of C data allocations that triggers a garbage collection run. It defaults to 8 million which is rather low for most Ruby apps.
Previously under REE we had this set to 80 million or 10 times less frequent than the default. We tried several different settings of this variable and narrowed in on dropping it down to 30 million. We found that garbage collecting more frequently in terms of number of allocations as compared to REE was more performant. We were not surprised to be changing this variable since the garbage collection algorithm in Ruby 1.9.3 was modified to a lazy sweep approach.
Next we wanted to understand how many Passenger Rails processes we could allow to run per server. Ngin is a very memory intensive Rails app, as outlined in Upgrading to Ruby 1.9.3: An Ngin Platform Overview. Ngin was memory bound with REE. Running on EC2 High CPU Extra-Large instances with 7 GB of RAM and 8 cores we were only able to run Passenger with 9 Rails processes as each Rails process utilized 400-500 MB of RAM. With Ruby 1.9.3 the memory footprint was reduced by 45% down to 250-275 MB of RAM!
We knew we could allow up to 18 Rails processes from a memory perspective. Ngin spends half of each request waiting on the database or other web services. With REE having 9 single-threaded Rails processes running on an 8 core machine we were only utilizing 50% of the cpu! With the reduced memory footprint of 1.9.3 we were hoping we could run with 16 Rails processes, 2 per core, before hitting a CPU bottleneck.
A common stress testing strategy is to use a service like loadimpact.com or set up a script using apache benchmark to create load on the servers. Instead we had emproxy sending over live production traffic. Since we couldn’t adjust the amount of live production traffic coming in, we simply adjusted the number of app servers down. We started with 11 app servers which is how many we had running with REE. We had to drop all the way down to 3 app servers to stress the system to the point that all 18 Rails processes were actually running on each server and passenger-status showed that requests were consistently queuing.
At this point cpu utilization was at 100% and our average response time had increased around 150 ms. It was clear we had reached a CPU bottleneck. Further testing in this manner and we narrowed in on 16 Rails processes as the limit. Any more and each request would begin to take longer due to waiting on CPU. Any less and we likely wouldn’t be fully utilizing CPU. This means we could run with 6 app servers (6*16 = 96 total Rails processes), a huge reduction from the 11 app servers (11*9 - 99 Rails processes) we were running with REE.
A discovery that became apparent as soon as we had dropped down to 6 app servers was that we needed to up the number of nginx workers. We were following the recommended best practice of running one Nginx worker per core, which in this case was 8 Nginx workers to match the 8 cores on our High CPU Extra Large EC2 instances. There are several places that recommend this, such as: http://blog.martinfjordvald.com/2011/04/optimizing-nginx-for-high-traffic-loads
We were however seeing a large amount of request queuing time, about 80 ms. See our previous article, Rails Middleware Timing with Rack Timer, where we show how we discovered that the majority of our request queuing time was due to SSL traffic. Since SSL traffic was not proxying on through emproxy we expected to see little or no request queuing time. When the request queuing time increased upon dropping down to 6 servers and the passenger rails processes were not queuing, we became suspicious that the Nginx workers were the culprit. We increased the Nginx workers from 8 to 24, or 3 per core, and this resulted in dropping the 80 ms of request queuing time down to 3 ms!
Average response time with 8 Nginx workers (1 per core)
Average response time with 24 Nginx workers (3 per core), showing an 80 ms improvement by reducing the request queuing time.
So what is the lesson learned here? Don't believe anything until you have measured it against your application. Avoid putting trust in general best practice advice, benchmarks, or something you read on a random blog post like this one. Measure everything against your own application as each application’s needs differ. Measure, measure, measure.
The upgrade to Ruby 1.9.3 went smoothly without downtime. Following the upgrade we had very few surprises as we knew exactly what to expect. Still we were a bit disappointed with the performance improvements, which was slight to the point of not being statistically significant.
Over the next few days following the upgrade we made two important changes that helped considerably. The first was turning garbage collection profiling off. We had been running with this on for quite some time as it did not noticeably impact performance with REE. It was nice to have New Relic graphs show the time spent in garbage collection. Turning it off with Ruby 1.9.3 gave us a noticeable performance boost of approximately 100 ms on average per request.
The vertical line represents the deploy where GC profiling was turned off, which gave us roughly 100 ms improvement.
Next we ran across the kgio gem which the dalli memcache client gem can take advantage of to see performance improvements of 20-30% on memcache calls! We saw our Ruby time, as opposed to the memcache time, in New Relic decrease 20 ms on average per request after installing kgio. It was a drop-in performance boost! If you are using dalli be sure to install kgio.
With these final two tweaks in place we were seeing a final performance boost of 100-125 ms on average per request.
Average response time across a 3 hour span showing a 100-125 ms improvement. Ignore the Yesterday line below which was for a Sunday with reduced traffic, and compare the Today line (Ruby 1.9.3) to the Last Week line (REE).
As an aside, after the upgrade we noticed New Relic was reporting that the number of Memcache gets had doubled with Ruby 1.9.3. We suspected an unintentional application behavior change was responsible, but upon further investigation it turns out that not all of the Memcache gets were being reported with REE which simply gave the illusion that our Memcache Get calls doubled.
We are quite happy with Ngin running on Ruby 1.9.3. We have reduced the number of servers Ngin needs significantly and increased performance modestly. Our final configuration is:
Each variable above is highly specific to Ngin. To maximize your application’s performance it is critical to measure your configuration changes against production traffic. The combination of duplexing production traffic with emproxy and measuring with New Relic is extremely powerful. We highly recommend it!