Lessons from Scaling Facebook's Online Data Infrastructure

July 21, 2020

Venkat Venkataramani

CEO, Rockset

lightbulb

Lessons from scaling facebook's online data infrastructure

There are 3 growth numbers that stand out when I look back at the hyper-growth years of facebook from 2007 until 2015, when I was managing facebook's online data infrastructure team: user growth, team growth and infrastructure growth. Facebook’s user base grew from ~50 million monthly active users to a billion and half during that time, which is about a 30x growth. The size of facebook’s engineering team grew 25x during that time from about ~100 to ~2500. During the same time, the online data infrastructure’s peak workload went up from about 10s of millions of requests per second to 10s of billions of requests per second — which is a 1000x growth.

Scaling facebook’s online infrastructure through that 30x user growth was a huge challenge. But the challenge of keeping pace with facebook’s prolific product development teams and new product launches was the greatest challenge of them all.

There is another dimension to this story and another significant number that always stands out to me when I look back to those years: 2.5 hours. That was how long facebook’s most severe outage lasted during those 8 years. Facebook was down for all users during that outage [1, 2]. The recent Twitter bitcoin hack brought back a lot of those memories to many of us who were at facebook at that time. In fact, there is only one other total outage during that time I recall that lasted about 20-30 mins or so that comes close to the level of disruption this caused. So, during those 8 years when facebook’s online infrastructure scaled 1000x, it was completely down for all users for a few hours in total.

The mandate for facebook’s online infrastructure during that time could simply be captured in 2 parts:

make it easy to build delightful products
make sure facebook stays up and doesn’t go down or lose user data

How did facebook achieve this? Especially when one of facebook’s core value was to MOVE FAST AND BREAK THINGS. In this post, I will share a few key ideas that allowed facebook’s data infrastructure to foster innovation while ensuring very high uptimes.

move-fast-with-stable-infra

Scaling principles:

Build loosely coupled data services.

Monolithic data stacks will hurt you at so many levels. Remember facebook was not the first social network in the world (both myspace and friendster existed before it) but it was the first social network that could scale to a billion active users. With monolithic data stacks:

you will lose your market → since your product teams are moving slow, and you will be late to the market
you will lose money → your product teams will end up over-engineering and over-provisioning the most expensive parts of your infrastructure, and you will also need to hire a large product and operations team for ongoing maintenance.
you will lose your best engineers → good engineers want to get things done and push them to production. When product launches get mired in pre-launch SRE checklist traps, it will kill innovation and your best engineers will leave to other companies where they can actually launch what they build.

Follow good patterns with microservices. When these services are built right, they can address all of these concerns.

Microservices, when done right, will allow parts of your application to scale independently.
Similarly, microservices will also allow parts of your application to fail independently. It will allow you to build your infrastructure in a way that some part of your app could be down for all of your users, or all of your app could be down for some of your users, but all of your application is seldom down for all of your users. This is massive and directly helps you achieve the two goals of moving fast and ensuring high application uptime simultaneously.
And of course, microservices allow for independent software lifecycle + deployment schedules and also allows you to leverage a different programming languages + runtime + libraries than what your main application is built in.

Avoid bad patterns with microservices:

Don’t build a microservice just because you have a well abstracted API in your application code. Having a well-abstracted API is necessary but far from being sufficient to turn that into a microservice. Think about the key reasons mentioned above such as scaling independently, isolating workloads or leveraging a foreign language runtime & libraries.
Avoid accidental complexities — when your microservices start depending on microservices that depend on other microservices, it is time to admit you have a problem, look for a nearest “Microservoholics Anonymous” and laugh at this video while realizing you are not alone with these struggles. [3]

Embrace real-time. Consistency is expensive.

Highly consistent services are highly expensive. Embrace real-time services.
Reactive real-time services are the ones that replicate your application state through change data capture systems or using Kafka or other event streams, so that a particular part of your application can be powered off of a real-time service (imagine facebook’s newsfeed or ad-serving backend) that is built, managed and scaled independently from your main application.
90% of the apps in the world can be built on real-time data services.
90% of the features in your app can be built on real-time data services.
Real-time data services are 100-1000x more scalable than transactional systems. Once you need cross-shard transactions and you hear the words "two", "phase" and "commit" next to each other — go back to the drawing board and see if you can get away with a real-time data service instead.
Identify and separate parts of your application that need highly consistent transactional semantics and build them on a high quality OLTP database. Power the rest of your application using real-time data services with independent scaling and workload isolation.
Move fast. Ensure high application uptimes. Have your cake. Eat it too.

Centralized services are actually awesome.

Especially for meta-data services such as the ones used for service discovery.
Good hygiene around caching can take you a really long way. It is essential to think through what happens when you have a stale cache but with sane stale cache system behavior you can go far.
In your application stack, assume for every level you have in your stack, you will lose one 9 in your application’s reliability. This is why a multi-level microservices stack will always be a disaster when it comes to ensuring uptime.
Metadata services used for service discovery are close to the bottom of that stack and they need to provide 1 or 2 orders of magnitude higher reliability than any service built on top of that. It is very easy to underestimate the amount of work it takes to build a service with such high availability that it can act as the absolute bedrock of your infrastructure. If you have a team working and maintaining such as service, send that team a box of chocolates, flowers and nice bourbon.

Data APIs are better than data dumps.

Data quality, traceability, governance, access control are all superior with data APIs than data dumps.
With data APIs, the quality of the data actually gets better over time while maintaining a stable, well-documented schema, not because of some superior black magic technology but simply because you usually have a team that maintains it.
Data dumps that have gotten rotten over time appear just as pristine as how they looked the day the data set was created. When data APIs rot, they stop working which is a very useful property to have.
More importantly, data APIs naturally allow you to build apps and push for more automation to avoid repetitive work, allowing you to spend more time on more interesting parts of your work that are not going to be replaced by our upcoming AI overlords.

General purpose systems beat special-purpose systems in the long run.

Engineers love building special purpose systems since most of them overvalue machine efficiency and undervalue their own time.
Special purpose systems are always more efficient than general purpose systems the day they are built and always less efficient a year after.
General purpose systems always win in extensibility and hence support you better as your product requirements evolve over time. Extensibility beats hardware efficiency in every TCO analysis that I’ve been part of.
The economies of scale with general purpose systems that power a lot of different use cases allows for dedicated teams to work endlessly on long series of 1% and 2% reliability and performance improvements. The compound effect of that is immense over time. Such small improvements will never make the cut in your special purpose system’s roadmap albeit technically speaking those improvements might be relatively easier to achieve.

I hope some of you find these ideas useful and applicable to your organization and allow you to MOVE FAST WITH STABLE INFRASTRUCTURE [4] instead of moving things and breaking fast [5]. Please leave a comment if you found this useful or you would like me to expand on any of these principles further. If have a question or have more to add to this discussion, I’d love to hear from you.

[1] https://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919

[2] https://techcrunch.com/2010/09/23/facebook-down/?_ga=2.62797868.161849065.1594662703-1320665516.1594662703

[3] https://youtu.be/y8OnoxKotPQ

[4] https://www.businessinsider.com/mark-zuckerberg-on-facebooks-new-motto-2014-5

[5] https://xkcd.com/1428/