How Twitter and FAANG develop and launch recommendation systems
And why open sourcing "the algorithm" won't benefit anyone
Off the back of Elon’s Twitter takeover, there has been a lot of discussion about the Twitter “algorithm”. It even looks as if someone at Twitter acknowledged the debate by creating a repo for “The Algorithm”. Having worked on highly scalable recommendation systems, it’s interesting to see this debate reach a wider audience. Newfound interest in recommendation systems has come with a large dose of misunderstanding, some of which I hope to clear up here.
Disclaimer: I have never worked at Twitter. I’m extrapolating from discussions with two friends at Twitter and from what I have seen in current and past roles. If you work at Twitter and have worked on the feed, please chime in and correct anywhere this deviates from reality!
To understand the state of the Twitter feed today, you have to understand the context of how it was developed. This is true for all recommendation systems that have been in use in production for a long period of time. A vast majority of machine learning projects at FAANG+ companies involve iterating on an existing system to improve performance. Sometimes the previous algorithm is just a heuristic, and sometimes the previous algorithm is a complex deep learning model. For an outsider or a new employee onboarding to a team, this is important to understand since the existing system will define your constraints.
Here’s a toy example in the context of Twitter:
Imagine you are a new member of the Twitter feed ranking team. Your team is responsible for assigning a rank to a Tweet, determining the order that a Tweet appears in the feed. To an outsider, you are responsible for “the algorithm”, everything that occurs in between an author writing a tweet and Twitter assigning a rank. In reality, your team only has ownership over one part of the system, the “recommendation model”.
Let’s go through the components of this system:
Toxicity score: a machine learning model that evaluates a tweet for harmful content and is maintained by the “safety” team
Engagement metrics: real-time numbers on number of likes, retweets, comments and profile visits
Share metrics: number of shares and share method
Follower embedding: a graph neural network that represents your followers as an embedding vector
Recommendation model: this is your core responsibility, a machine learning model that aggregates and transforms input features and produces an intermediate score used to rank tweets
Filters: after scoring, some tweets are flagged as ineligible due to factors that are external to your team’s data. For example, the “Safety” team also maintains a list of shadow banned users whose tweets are not distributed in the feed
Rank: the pipeline aggregates scores for many tweets to produce a ranking to be used to serve tweets to the end user
Even in this toy example we see strong dependencies and complexity. For example, any changes in the toxicity score sourced from the “Safety” team has potential to change your ranking. Ranking results can also be overridden by filters that largely take effect after your recommendation service outputs results. This is why FAANG+ companies place such a strong emphasis on A/B testing; small changes can have large knock-on effects and the only certain way to evaluate a change is to test it for statistical significance.
So you’re onboarded now. Your task is to take the system above and improve it. A typical machine learning project follows the steps below:
Define the most important success metrics
You start by defining what you are trying to achieve. Projects fall into a few categories:
Addressing a defect: e.g. users are complaining that Tweets containing German words are down-ranked.
Improving metrics: e.g. increase user engagement
Improving performance: e.g. improve the user experience by decreasing latency for serving rankings
Define the experiment
If the project goal is to increase user engagement, you will focus on a set of metrics that (you hope) represent the goal. For engagement, you might use metrics such as number of likes, comments, shared tweets, and total time spent in the feed. These metrics will determine the success of your experiment or A/B test.
Experiment planning starts by defining your control and treatment arms:
Control: the existing recommendation system
Treatment: the change in the recommendation system you are looking to deploy
Another consideration is how widely you will launch a feature. Very often experiments and features are restricted to a subset of users. For example, you may only have access to an English language toxicity score, meaning that you must use a different recommendation algorithm for non-English language users.
Do the science and engineering
Translating goals and requirements into hard machine learning work is a topic of its own. My personal guiding principle is: the simplest system that gets the job done. During this stage of the project you research a range of options. A complex flagship machine learning project might require state of the art algorithms. On the other end of the spectrum, an MVP might require shipping a heuristic in place of a machine learning model in order to get 80% of the benefit today rather than 90% of the benefit in 3 months time.
Run the experiment
Run your experiment until the results reach statistical significance or for a predetermined length of time. During this time you monitor user and performance metrics to catch anything bad.
Analyze the experiment
Typically, the success of an experiment in terms of primary metrics is clear. If your experiment goal was to improve user engagement, you look for a statistically significant improvement in the number of user likes and comments.
However, for a complex system such as Twitter, there will be many secondary metrics to examine. If your experiment improved engagement: did this occur evenly across all demographics? What was the impact on ad revenue? Did increased engagement lead to increased reports for harmful content? The answer to many of these questions can be ambiguous. During the course of analysis you make your case and provide your own interpretation of the results.
If your experiment was a success and passed any post-experiment reviews, it’s time to launch! Newly launched experiments will be monitored for some additional time in case there are any longer term impacts.
Comments on the process
This iterative process of improvement has occurred dozens of times in order to improve the Twitter feed. There are two common viewpoints on this process: (a) iterative improvement via A/B testing results in a highly optimal system, (b) A/B testing results in a sub-optimal local minimum and unnecessarily complex systems.
After working on recommendation systems, I’m unsure which side I stand on. Across engineering and especially machine learning, there is always the temptation to “start again”. Engineers love to work on greenfield projects. On the other hand, I have seen the power of iterative development, delivering machine learning goals that teams did not think were possible. Iterative development can result in additional complexity, but there are many problems that require that complexity and cannot be designed optimally from scratch.
Open sourcing “The Algorithm”
Finally, I want to discuss one of Elon’s goals post-takeover: to open source “the algorithm”. I guess this is to improve public trust in how Twitter serves content. The aim is noble but not likely to result in any tangible benefit.
Following the typical development cycle for machine learning projects, the Twitter ranking algorithm is likely to be layer upon layer of complexity as a result of years of experimentation and dependencies. Even our toy example above is so complex that the concept of open sourcing the algorithm is unclear.
What does open-sourcing the algorithm mean? Perhaps it is to be taken literally: open sourcing all the associated code that produces recommendations. This would require open sourcing all the dependencies, otherwise you can never gain a clear picture of how tweets are ordered.
On the other hand, perhaps open sourcing the algorithm merely means documenting the intention of the feed and releasing this documentation. The problem here is that even Twitter employees don’t know the true behaviour of the system! If they did, there would be no need for experimentation and A/B tests. It is very common for an experiment to fail due to unpredictable downstream effects. Releasing documents describing the intention of “the algorithm” might as well be fiction. There is only loose evidence that the algorithms perform as intended by their creators.
Ultimately I don’t think Twitter critics will be satisfied unless there is a complete overhaul in the way that the feed works. Some have suggested that Twitter should create an API to allow custom feeds. A majority of users would still use the official Twitter feed, but this could shift the narrative away from “the algorithm” and towards the users.