On The Data Engineering Show, we recently had the pleasure of speaking with Steven Moy, a Software Engineer at Yelp.
Yelp really needs no introduction, but in case you don’t know, Yelp is an extremely popular website and app that publishes crowd-sourced reviews about local businesses and provides a table reservation service for restaurants. There are almost 5 million businesses claimed on Yelp. The website and app get around 178 million views combined each month.
As an expert in query engines and performance-related challenges, Steven Moy went on a deep dive with us, talking about how Yelp has handled its huge data growth in the past ten years.
We all know that the restaurant and hospitality industries were hit hard during the pandemic, so naturally Yelp usage declined as well. Now that the vaccine is being distributed, we are all excited to go back to using Yelp to help us find and review all our favorite restaurants and local businesses.
As data sets go, we think Yelp has a pretty cool one. So let’s get into it.
How big is Yelp’s data stack?
One of the coolest features of Yelp is its ability to make recommendations for great local businesses for you to check out or even what the best menu items to order are. To provide the best recommendations, Yelp needs to have a strong understanding of how users are interacting with the website and app.
Yelp collects and stores data on all of these user events. With millions of visits each day, this amounts to an enormous amount of data.
The user events are streamed through Kafka and are then forwarded to multiple data lakes. The data lakes are a very efficient way to store lots of data, but they are not very efficient when you want to use the data. Therefore, they also stream their data to an Amazon Web Services (AWS) cloud warehouse.
Yelp has been an outspoken AWS partner for years and has been using them for their cloud data warehouse for many years.
In 2010, Yelp implemented their first data warehouse solution. They started with a leader database and ran a lot of replicas. As the company grew and hired analysts, the replicas started to take too long to run, which slowed the team down. They needed to find a way to scale up and build a MySQL analytics-specific replica data warehouse.
This is when they switched to an AWS Amazon Elastic MapReduce (EMR) stack. This enabled them to speed up the transient clusters, provided good compute compatibility, and plenty of power.
This solution continued to work for them until 2013. They wanted to maintain a high-performance data product so they piloted screening data into Amazon Redshift. This made a world of difference for speed and productivity. Once they made the switch, analysts were able to run things in mere seconds that used to take them an hour.
However, the Yelp data team found that they needed to find another solution yet again in 2017.
Redshift was so popular with the team and it worked extremely well, so they continued to use it a lot. The problem is that Redshift uses a pay-per-use model, so with so much usage the costs were growing. The cost was already too high and was not sustainable for future growth.
So in 2017, the team decided to build a data lake solution instead. They now store all of their event stream data in Parquet and S3, and they use an Amazon data catalog. This solution allows them to ship data to Amazon Redshift and Athena. They also use Spark connectivity to mine directly on S3 via Parquet.
Steven reminds us that there is always a push and pull between innovation and cost. You need to always be innovating, but you also have to do it in a cost-conscious way. One way to do this efficiently is to work backwards. When their data usage was too high, they evaluated why they were scanning so many TB and tracked uses based on event types. This way, they were able to find an efficient solution, drop costs significantly, and continue providing all the same features to their customers.
How many people work on data initiatives at Yelp?
Data is huge at Yelp. Almost half of the organization needs to derive information from the data on a daily basis.
They use a lot of different technologies to manage their large data stack, so they have a dedicated team to focus on each technology. They also match data scientists with various other groups at Yelp so teams can work together to explore what is possible.
Much of the data is siloed within the organization because each team needs to build a unique use for the data. Every team that needs to track insights within the organization has their own data sets.
At the time of speaking with Steven, Yelp had an analyst team of 5 people.
How does Yelp manage so much user-generated content?
Yelp runs off of user-generated content, such as photos and reviews. With millions of visits each day, that’s a lot of contributions! So what does Yelp do with all of it?
For example, Yelp uses machine learning to figure out what the most popular dishes are at a restaurant based on what gets photographed and written about the most. Pretty cool, right? So how did they come up with this idea.
The idea actually was created at an annual hackathon. Yelp has 2–3 hackathons per year. One year, they wanted to focus on enhancing the “popular dishes” feature. Some attendees came up with a great idea to analyze the photos.
They immediately engaged the machine learning engineering team on this project, as well as their content team and data scientists to evaluate the new feature. It then went through prototyping and AB testing, and now is the great feature we use today.