Machine Learning-Powered Search Ranking of Airbnb Experiences

Airbnb Experiences are handcrafted activities designed and led by expert hosts that offer a unique taste of local scene and culture. Each experience is vetted for quality by a team of editors before it makes its way onto the platform.

We launched Airbnb Experiences in November 2016 with 500 Experiences in 12 cities worldwide. During 2017, we grew the business to 5,000 Experiences in 60 cities. In 2018, the rapid growth continued, and we managed to bring Experiences to more than 1,000 destinations, including unique places like Easter Island, Tasmania, and Iceland. We finished the year strong with more than 20,000 active Experiences.

As the number of Experiences grew, Search & Discoverability as well as Personalization have become very important factors for the growth and success of the marketplace.

In this blog post, we describe the stages of our Experience Ranking development using machine learning at different growth phases of the marketplace, from small to mid-size and large.

BJVFzuY.png!web

The first three stages of our Search Ranking Machine Learning model

The main take-away is that machine learning-based Search Ranking works at every stage, given that we pick the model and infrastructure with the right level of complexity for the amount of data available and the size of the inventory that needs to be ranked. Very complex models will not work well when trained with small amounts of data, and simple baselines are sub-optimal when large amounts of training data are available.

Stage 1: Build a Strong Baseline

When Airbnb Experiences launched, the amount of Experiences that needed to be ranked in Search was small, and we just started collecting data on user interactions with Experiences (impressions, clicks, and bookings). At that moment, the best choice was to just randomly re-rank Experiences daily, until a small dataset is collected for development of the Stage 1 ML model.

Collecting training data:To train our first machine learning model for ranking Experiences, we collected search logs (i.e. clicks) of users who ended up making bookings.

VnMr2uZ.png!web

Training Data Collection: Search session clicks from users who eventually made bookings

Labeling training data:When labeling training data, we were mainly interested in two labels: experiences that were booked (which we treated as positive labels) and experiences that were clicked but not booked (which we treated as negative labels). In this manner, we collected a training dataset of 50,000 examples .

Building signals based on which we will rank: In Stage 1 of our ML model, we decided to rank solely based on Experience Features. In total we built 25 features, some of which were:

Experience duration (e.g. 1h, 2h, 3h, etc.)
Price and Price-per-hour
Category (e.g. cooking class, music, surfing, etc.)
Reviews (rating, number of reviews)
Number of bookings (last 7 days, last 30 days)
Occupancy of past and future instances (e.g. 60%)
Maximum number of seats (e.g. max 5 people can attend)
Click-through rate

Training the ranking model:Given the training data, labels, and features, we used the Gradient Boosted Decision Tree (GBDT) model. At this point we treated the problem as binary classification with log-loss loss function .

When using GBDT, one does not need to worry much about scaling the feature values, or missing values (they can be left as is). However, one important factor to take into account is that, unlike in a linear model, using raw counts as features in a tree-based model to make tree traversal decisions may be problematic when those counts are prone to change rapidly in a fast growing marketplace. In that case, it is better to use ratios of fractions. For example, instead of using booking counts in the last 7 days (e.g. 10 bookings), it is better to use fractions of bookings, e.g. relative to the number of eyeballs (e.g. 12 bookings per 1000 viewers).

Testing the ranking model:To perform offline hyper-parameter tuning and comparison to random re-ranking in production, we used hold-out data which was not used in training. Our choice of metrics were AUC and NDCG , which are standard ranking metrics. Specifically, we re-ranked the Experiences based on model scores (probabilities of booking) and tested where the booked Experience would rank among all Experiences the user clicked (the higher the better).

In addition, to get a sense of what a trained model learned, we plotted partial dependency plots for several most important Experience features. These plots showed what would happen to specific Experience ranking scores if we fix values of all but a single feature (the one we are examining). As it can be observed in the plots below, the model learned to utilize the features in the following manner:

Experiences with more bookings per 1k viewers will rank higher
Experiences with higher average review rating will rank higher
Experiences with lower prices will rank higher

Stage 1: Build a Strong Baseline

Recommend

One Page Dungeon Generator

EFK教程 - EFK快速入门指南

死磕 java线程系列之线程池深入解析——生命周期

推荐几款能提升代码效率的笔记应用

如何购买BTC

go 正则的使用

Spring Boot项目中使用Mockito

k8s与HPA--基于Kubernetes的事件驱动自动缩放

树莓派 Cannot open access to console. The root account is locked

HTTP状态码

About Joyk