Paper review: Scaling Memcache at Facebook

Overview

Memcached is a general-purpose in-memory key-value store. While most users can benefit from the generic caching capabilities of Memcached, some companies, e.g. Facebook (now: Meta), need to push it to the limit. This is a paper review of "Scaling Memcache at Facebook" paper from 2013 (pdf on USENIX).

UDP to the rescues

You've probably heard that there are two primary protocols to share data between two servers: TCP and UDP. TCP is connection-based and capable of re-ordering packets and fixing the most common errors in the network, while UDP is best-effort based. But how bad is "best-effort"?

Facebook wanted to know too and so they implemented the GET operation on Memcached with UDP while SET and DELETE still worked over TCP. The result for GET requests was that 0.25% of requests were lost, but the average latency of the requests dropped by a whopping 20%. Although 95th percentile latency-wise stayed almost the same. The majority of the 20% gain goes to the requests in the middle of the distribution. Simply speaking: the slow requests got almost no benefits, but the averages become faster. Still a good thing.

But you can't just lose 0.25% without any remediation. Facebook solves this problem in a two-fold approach. First: they wrap out-of-order delivery and packet losses into client errors. So it is the client now who decides to continue without data or re-query again. Second: clients re-query data most of the time in case of failures. It works great for Facebook since some pages make thousands of requests for each page load, so making a few retries is cheap.

Retries and errors

Retry everything is not an ideal strategy in all cases. Furthermore, sometimes there are grey failures in the network between the client and the cache. Each failure in cache request (e.g. Network connection error) is a hit to the backend. If enough caches fail at the same time, it can significantly affect the traffic the backend receives.

To counter this problem, Facebook introduce a separate Memcached pool called Gutter. Function-wise Gutter is the same cluster of Memchaced instances, but a very tiny one, about 1% of the total size of the main pool. Whenever a client gets a connection error while trying to get the data from the main pool, the client re-tries this request in the Gutter. If there is no data in the Gutter, the client will call the backend and put the data into the gutter.

Now the system behaves differently. Ideally, the client will receive the data from the main cache pool. If that is not available, the client tries to get the data from the Gutter. Only if that is also not available client will call the backend and update the Gutter. When multiple clients try to access the same key, which is supposed to be located on the dead main pool instance, all of those requests will hit Gutter. Essentially it is an extra layer of caching.

The approach works tremendously well. By sparing only 1% of the total memory pool, the Gutter turns 10-25% of cache misses into hits and reduce the number of visible connection error on clients by 99%. There is however a downside: the data in the Gutter might be slightly outdated.

Updates and deletes

What happens when multiple clients try to change the same key given the key resides inside a heavily distributed system? Most likely, clients have different versions of the data and they try to update the same key with different data. This may lead to a vicious cycle of write-rewrite if the clients are not careful while updating the data.

Facebook combat it by introducing lease tokens. Whenever a client gets a cache miss, the same client is also getting a lease token for a key which allows it to update this key only. This way other clients can not update the same key for a short period. Another optimization of the client: whenever a client tries to get the key, which has a lease token issued, the client will freeze for a short while and try to re-query data again without returning errors to the caller. This works because when a client gets the lease token, it only takes a few milliseconds to update the key, therefore waiting is a good strategy to get the latest data.

There are times however when the client does not want to wait for the update. For example, if the data was deleted or when working with stale data does not lead to errors. For these cases, Memcached at Facebook does not delete the data straight away but rather puts it in a special data structure. When the client makes a GET request it can specify whatever it is fine with stale data. If it is, the client may receive either data which is already deleted or a stale version of the data instead of getting the lease token. Effectively this approach turns some fraction of cache misses into hits.

Conclusion

Designing distributed systems and distributed systems at a very large scale are two very different things. When dealing with scale some ideas which sound crazy like switching from TCP to UDP may be beneficial. However, you should always make the judgment based on the data and metrics and test every single heuristic before rolling it out in the production system.