Social media site Pinterest uses a memcached-based cache-as-a-service platform to reduce application latency across the board, minimizing the overall cloud cost footprint and meeting stringent availability goals at the scale of the site.
Pinterest’s memcached-based caching system is massive – over 5,000 Amazon Web Services EC2 Dedicated Instances, which span a variety of instance types optimized for compute, memory, and storage dimensions. As a whole, the fleet processes up to approximately 180 million requests per second and approximately 220 GB/s of network throughput on an active in-memory and on-disk dataset of approximately 460 TB, distributed among approximately 70 separate clusters.
With such a large fleet, any optimization, no matter how small, has a magnified effect. Since each copy of memcached runs as an exclusive workload on its respective VM, Pinterest engineers were able to tune the VM’s Linux kernels to prioritize CPU time for the layout system. distributed memory object cache. A configuration change reduced latency by up to 40% and smoothed overall performance.
Additionally, by observing data patterns and dividing them into workload classes, Pinterest engineers dedicate EC2 instances to specific workloads, reducing costs and improving performance.
Kevin Lin, Pinterest Software Engineer for Storage and Caching, in a blog entry posted last week, discussed what the company learned from running Memcached at scale in production. Here are some of the highlights.
Scientific theory at work
In order to identify possible areas of optimization, Pinterest set up a controlled environment to create structured and repeatable workloads for testing and evaluation purposes. The following variables have enabled high-level testing with minimal impact on critical path production traffic over the years.
- Server-side: This includes metrics for request throughput, network throughput, resource utilization, and hardware-level settings (NIC statistics such as packet per queue and EC2 allocation exhaustion, time disk response and current I/O requests, etc.).
- Client-side metrics for cache request percentile latency, timeouts and error rates, and per-server availability (SLI), as well as higher-level application performance metrics like RPC response time P99.
- Synthetic load generation: This practice is historically known to detect performance improvements or regressions under peak load. Pinterest used memtier_benchmark, an open source tool that generates their load on a memcached cluster.
- Production Shadow Traffic: This is the process of mimicking real production traffic for the purpose of evaluating system performance at scale. Pinterest uses Facebook’s mcrouter, an open source memecached protocol routing proxy deployed as a client-side sidecar in the Pinterest fleet.
The results are in
Pinterest has divided its diverse collection of high-level workloads into the workload classes below. Each class has been designated for a fixed pool of EC2 instance types optimized to allow vertical scaling for better cost effectiveness. Since horizontal scaling is available at all times to alleviate bottlenecks as they arise, it is not the most cost effective method. Pinterest was more interested in vertical scalability.
- Flow (calculation)
- Data volume (memory and/or disk capacity)
- Data bandwidth (network and compute)
- Latency requirement (calculation)
The following workload profiles were created from the above classes:
- Moderate throughput, moderate data volume | r5
- High Speed, Low Data Volume | c5
- High data volume, relaxed latency requirement | r5d
- Massive data volume, relaxed latency requirement | i3, i3fr
When selecting EC2 instances and determining which cluster will go to which instance, it mainly came down to these criteria: CPU, memory size, and disk speed.
Decomposing data models into workload classes and dedicating specific EC2 instances to each workload class helps reduce costs while improving I/O performance. Adding additional memcached extstore improves storage efficiency while proportionally reducing the cloud cluster footprint.
Instances are configured with Linux software RAID at the RAID0 level to combine multiple hardware clock devices into a single logical drive for user space consumption. Pinterest stripes read and write evenly to two disks, thus RAID0 doubles the maximum theoretical I/O throughput with a best-case-scenario halving of the effective disk response time with a troubled MTTF.
Pinterest makes it clear that this increased hardware performance for extstore at an increased theoretical failure rate is a very valid trade-off. The infrastructure is on a public cloud, it self-heals and mcrouter is able to handle server changes immediately.
Memcached docs define “computational efficiency” as the additional rate of requests that can be serviced by a single instance for each one percentage point increase in the instance’s CPU utilization, without increasing request latency. By this definition, optimizing computational efficiency is measurable by allowing memcached to serve a higher request rate with lower CPU utilization without changing latency characteristics.
Half of all caching workloads at Pinterest are compute related (only request throughput related). Pinterest’s goal was to reduce cluster size without compromising service capacity.
With the majority of Pinterest workloads running on dedicated EC2 virtual machines, this opens up a unique opportunity to optimize at the hardware-software boundary where in the past optimizations were focused on hardware changes.
Memcached is somewhat unique among Pinterest’s stateful data systems in that it’s the exclusive core workload, with a static set of long-running worker threads, on every EC2 instance it’s on. is deployed. For this reason, Pinterest has tuned process scheduling to instruct the kernel to prioritize CPU time for memcached at the expense of deliberately withholding CPU time from other processes on the host, such as watchdog daemons.
This involves running memcached under a real-time scheduling policy, SCHED_FIFO, with high priority – instructing the kernel to effectively allow memcached to hog the CPU by pre-empting (essentially starving) all non-real-time processes whenever ‘a memcached thread becomes executable. Below is an example of invoking memcached under a real-time scheduling policy SCHED_FIFO
$ sudo chrt — — fifo
This one-line change reduced client-side P99 latency by 10% to 40% after deployment to all compute-related clusters and eliminated false P99 and P999 latency spikes across the board. The steady state CPU usage cap was raised by 20% without introducing latency regressions and 10% of the total memcached fleet-wide cost footprint was reduced with this optimization.