Airbnb Rental-Analysis of New York using Python

 2 years ago
source link: https://towardsdatascience.com/airbnb-rental-analysis-of-new-york-using-python-a6e1b2ecd7dc
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

1. Introduction to the dataset

Airbnb is a San Francisco-based company with presence in more that 81,000 cities worldwide, with more than 6 million listings of rentals available. From a data-creation standpoint, the company has generated enormous amounts of data from the cities in which it operates, such as reviews from users, location’s descriptions, and rental statistics.

It has emerged as a dominant player in the online-rental marketplace industry as tourists and business travelers, among others, choose it as a means of renting rooms, apartments, or any available for homestay residence. The company has been a pioneer in establishing Homestay as a popular form of hospitality and lodging whereby visitors share a residence with a local of the city to which they are traveling and a varying length of stay to be pre-arranged.

The dataset we’re going to analyze can be obtained in my GitHub account. The first step before analyzing a dataset is to preview the information it contains. To process this information easily we’re going to use Pandas, the Python library for data manipulation and analysis that offers data structures and operations for manipulating numerical tables and time series.

2. Data-Exploratory Analysis

Before we jump on straight to data-cleaning and transformation of data, let’s set the premises to be analyzed:

A. What proportion of the rentals correspond to each room type?

B. How are rentals distributed among the five boroughs of New York?

C. What’s the price distribution and what’s the range of fair prices available?

D. Differentiate prices among available room types.

E. Which are the most popular locations to rent a lodge?

F. Rental Recommendation for the patient readers.

We’ll begin to work by importing the necessary libraries.

# Imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Proceed with the obtention of the data included in the .CSV file that has to be located in your working directory, utilizing the Pandas’s function read_csv. After that, print in the console a preview of the data, just to have a look at the variables and features involved. The built-in function head returns the first five rows of the dataset:

data = pd.read_csv('new_york.csv')

After this step, you’ll get the following output:

Now, let’s take a look at the basic information of the dataset, including features, data types, rows, entries and every value available at first sight. We’ll utilize the info function to achieve this:


As it can be seen in the Gist, the dataset contains 16 columns, 48,895 entries or rows, and different data types, such as Numpy integers, Objects, and Numpy floats. Some values

Among the features or columns, we can find an ID, name of the landlord, rental ID, host name, borough, and other valuable information from which we’ll extract conclusions later on.

Proceed with the replacement of “missings ”or null values from the dataset with zero, as nulls are mostly focused in the “number of reviews” and “last review” columns and they have no useful application in our analysis. Be careful at this point, as not every time you see a “missing” value you should replace it:


Check for duplicated entries, it should return zero:


Now that we’ve cleaned the unnecessary features from the dataset, let’s move on to the analysis.

A. What proportion of the rentals correspond to each room type?

Image for post
Image for post
Image by Author

From the whole rentals available in the dataset, 52% of them correspond to entire-home apartments, 46% to private-room rentals and the minority remaining corresponds to shared rooms with a 2% of the sample.

The following code should enable you to obtain the presented visualization:

# 1 - Pie chartroom_type = data.groupby('room_type')['latitude'].count().reset_index()
plt.pie(room_type['n_rooms'],autopct='%1.2f%%', colors=['darkcyan', 'steelblue','powderblue'])
plt.title('Room-type Rental Distribution', fontsize='15',color='b')

B. How are rentals distributed among the five boroughs that I mentioned in the introduction?

Image for post
Image for post

As it’s reflected in the visualization, Brooklyn and Manhattan boroughs concentrate the majority of the listed rentals on Airbnb, adding up more than 40,000 rentals between the two of them. This means that the bulk of visitors of New York stay in properties, rooms or residencies located in these areas.

Take a look at the code to create the plot:

# 2 - Bar plot with neighbourhood distributionneighbourhood = data.groupby('neighbourhood_group')['neighbourhood'].count().reset_index()
fig,ax = plt.subplots(figsize=(12,8))
plt.ylabel('Rentals', fontsize='15')
plt.title('Rental Distribution by Neighbourhood Group',fontsize='15')

C. What’s the price distribution and what’s the range of fair prices available?

Image for post
Image for post
Image by Author

Price distribution is focused around the $ 300–400 prices with few observations that present higher prices.

# 3 - Histogram plot with price distributionprice = data.loc[:,['neighbourhood','price']].set_index('neighbourhood')
price_stats = data['price'].describe().reset_index()
price_counts = price.price.value_counts().reset_index()
fig2,ax = plt.subplots(figsize=(12,8))
for tick in ax.get_xticklabels():
plt.ylabel('Rentals', fontsize='15')
plt.title('New York Price-Rental Distribution',fontsize='15')

D. Differentiate prices among available room types.

Image for post
Image for post
Image by Author

Among the different room types available, the most demanded ones are entire-homes followed by private-rooms and lastly shared rooms, a tendency that is replicated from the first analysis.

In this case, we note that in less-demanded boroughs, such as The Bronx or Staten Island, price-dispersion is narrowed between private-rooms and shared-rooms while entire-homes maintain a price-dispersion.

More demanded locations, as Manhattan, have similar prices for private-rooms rentals than entire-homes rentals in less demanded locations.

# 4 - Bar plot with price to location distributionloc_price = data.groupby(['neighbourhood_group','room_type'])['price'].mean().reset_index()
locations = loc_price.neighbourhood_group.unique()
x_rooms1 = [0.8, 3.8, 6.8, 9.8, 12.8]
x_rooms2 = [1.6, 4.6, 7.6, 10.6, 13.6]
x_rooms3 = [2.4, 5.4, 8.4, 11.4, 14.4]
y_values1 = loc_price[loc_price['room_type'] == 'Entire home/apt']['price'].values
y_values2 = loc_price[loc_price['room_type'] == 'Private room']['price'].values
y_values3 = loc_price[loc_price['room_type'] == 'Shared room']['price'].values
fig3,ax2 = plt.subplots(figsize=(16,11))
plt.bar(x_rooms1, y_values1, color='purple', edgecolor='b')
plt.bar(x_rooms2, y_values2, color='b', edgecolor='b')
plt.bar(x_rooms3, y_values3, color='yellowgreen', edgecolor='b')
ax2.set_xticklabels(locations, fontsize='12')
for tick in ax2.get_xticklabels():
plt.ylabel('Prices', fontsize='15')
plt.legend(labels=loc_price.room_type.unique(), loc='best')
plt.title('New York Price-Rental Distribution by Location and Room-type',fontsize='15')

E. Which are the most popular locations to rent a lodge?

Image for post
Image for post
Image by Author

Based on Airbnb users reviews, we can deduce which rentals were most visited or most popular (which does not mean that they’re the best, but to simplify the case, let’s consider that if they were visited more times, maybe it’s because previous visitors left good reviews).

In the image above, we see that most reviewed locations involve less-demanded boroughs, where most popular rentals tend to average 550 reviews from users.

# 5 - Most reviewed spotsreview = data.sort_values('number_of_reviews',ascending=False)
top_reviewed = review.loc[:,['neighbourhood','number_of_reviews']][:20]
top_reviewed = top_reviewed.groupby('neighbourhood').mean().sort_values('number_of_reviews',ascending=False).reset_index()
fig4,ax3 = plt.subplots(figsize=(12,8))
plt.plot(top_reviewed['number_of_reviews'], marker='o', color='red',linestyle='--')
plt.ylabel('Reviews', fontsize='15')
for ax in ax3.get_xticklabels():
plt.title('Most-Reviewed Rentals by location',fontsize='15')

F. Rental Recommendation for the patient readers

As we’ve seen in the images included above, there’s clearly greater demand for Manhattan and Brooklyn Boroughs.

You might be wondering, how much would it cost me to stay at those locations? Is it possible to find opportunities with this analysis which allow me to pay the least possible for a place in these boroughs while staying in a nice rental? The answer is yes!

One of the most famous places to reside in Manhattan is the Upper East Side (UES).The Upper East Side is home to some of the wealthiest individuals and families. There are luxury apartments inhabited by the chicest New Yorkers.

With Pandas available filters, we can filter for rentals located in the Upper East Side and we can obtain the cheapest rentals among the most reviewed ones, in order to find the most suitable place to stay!

These are the most suitable rentals for private rooms and homes, based on the criteria mentioned above:

Image for post
Image for post
Cheapest and popular entire-home rental
Image for post
Image for post
Cheapest and popular private-room rental

As we can see, just for $69 per night we can stay in one of the most luxurious locations of the world in a home-rental, and also, for $49 we could rent a private room if our budget is limited.

If there’s any pending doubt that you might have, consider taking a look to the script I include below:

import numpy as np
upper_east = data[data['neighbourhood'] == 'Upper East Side']
ninetieth_percentile = np.quantile(upper_east['number_of_reviews'], 0.85)
upper_east = upper_east[upper_east['number_of_reviews'] >= ninetieth_percentile]
upper_east = upper_east.sort_values('price',ascending=True)
private_room = upper_east[upper_east['room_type'] == 'Private room'].reset_index()
entire_home = upper_east[upper_east['room_type'] == 'Entire home/apt'].reset_index()
shared_room = upper_east[upper_east['room_type'] == 'Shared room'].reset_index()
private_cheapest = private_room.loc[0,:].reset_index()
entire_cheapest = entire_home.loc[0,:].reset_index()


There are no doubts that Airbnb and the online-rental marketplace industry are here to help us acquire better home or room rentals to optimize the traveling and tourism experience. My aim with this article was to provide you with a tool to facilitate access to such optimization and why not, decide in which location you’ll stay the next time you visit New York?

If you liked the information included in this article don’t hesitate to contact me to share your thoughts. It motivates me to keep on sharing!

You can find me in the following social networks:


Thanks for taking the time to read my article! Any question, suggestion or comment, feel free to contact me: [email protected]l.com

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK