Analytics-on-the-fly: from batch to real-time user engagement

August 11, 2020

Register for
Index Conference

Hear talks on search and AI from engineers at Netflix, DoorDash, Uber and more.


It was the winter of 2007 when I logged into my newly created Facebook account for the very first time and I was amazed to see Facebook immediately show me three of my friends with whom I had lost touch since elementary school. One of them was working in London in a multinational bank, the other one was an engineer at Google in their Silicon Valley office office and the third one was running a restaurant in my town of Guwahati, a sleepy town on the India-Myanmar border. I was simply stunned that Facebook’s technology had the ‘magic’ to connect me to three people who were my cricket-teammates when I was in elementary school. Facebook’s ‘magic’, then, was powered by the ability to process large amounts of information on a new system called Hadoop and the ability to do batch-analytics on it.

Then things started to become more real-time. Facebook created a special team called the ‘growth team’ that was in charge of recommending ‘friends’ to a newly signed up Facebook user.. gather a variety of information, both past and recent, on every person, and then build models to show them relevant posts from friends or friends-of-friends to improve their engagement metric. More the engagement, higher is the value-add to each individual user as well as more value to the facebook network. It was like an online multiplayer game, where each user is a player in the game, vying to learn useful titbits from other people in the network and also contributing one’s own perspective to the network. The recommendation models improved engagement when the models had access to more recent actions of its users. Data that used to be batch-loaded daily into Hadoop for model serving started to get loaded continuously, at first hourly and then in fifteen minutes intervals. If data feeds were delayed by an hour, that resulted in double-digit percentage revenue decline for that hour. No other enterprises were leveraging their most recent data like the Facebook growth team did at that time, and this was one of the biggest reasons why Facebook was able to beat out other technical rivals on its way... remember Orkut, FriendFeed, Ning, MySpace and GooglePlus.

Last December, we made a trip to Los Angeles for a family vacation and the moment I disembarked at LAX and turned on my Facebook app, it immediately showed me advertisements of some nearby restaurants. This needed a database that could use a location index to instantaneously find out the best ads for me. Facebook also showed me photos of my last trip to that city that I made in 2017; and this needed a secondary index on all my earlier photos that were taken at that location. No more batch analytics....this is analytics-on-the-fly!

The challenge of building analytical applications on your most recent datasets is a tough challenge. Why is that?

  • Firstly, if you have to make instantaneous decisions on recent data, you do not have time to clean it or sanitize it before processing. You need a database that can take in all forms of semi structured data without cleaning, schematizing or formatting.
  • Secondly, the incoming data streams are usually bursty in nature and you do not have a way to control its velocity. You need a system that auto-scales so you do not have to pre-provision it for peak capacity.
  • And thirdly, and most importantly, you need a system that can process hundreds or thousands of concurrent queries every second. Facebook addressed these challenges by hiring software developers who used systems like open source RocksDB, Scribe and TAO to address these.

Facebook was able to address these challenges because they built a multi-petabyte secondary index on all user’s contents. And queries on any dimension is fast because there is always an index that can make the query complete in milliseconds. This data-access enabler still keeps the Facebook juggernaut stomping on all their competition!

Are you enabling real-time access to all your datasets so that you can trample your competition? If so, great - tell me what your real-time data stack looks like. If not, check out Rockset.