A demonstration of carrying data analysis (New York City Airbnb Open Data)

In this article, I will perform data analysis by using Python and record down every finding and thought during the analysis. Practicing with real data can not only help me have more experience in analyzing different data but also help me discover more techniques and skills when compared with others’ works.

3EVRZnz.jpg!web

Photo by Oliver Niblett on Unsplash

The dataset used in this article is “New York City Airbnb Open Data” which is from Kaggle. Here is the download link https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

After importing modules and csv file, normally my first step is to explore the dataset. This step includes studying the format of the overall dataset, types of all variables, checking and cleaning the dataset.

data = pd.read_csv('AB_NYC_2019.csv')
data.info()class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id                                48895 non-null int64
name                              48879 non-null object
host_id                           48895 non-null int64
host_name                         48874 non-null object
neighbourhood_group               48895 non-null object
neighbourhood                     48895 non-null object
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null object
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       38843 non-null object
reviews_per_month                 38843 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB

So there are in total of 16 columns and some columns have missing values.

data.isnull().sum()id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

Both last_review and reviews_per_month have over 10k missing values in the dataset. And they also have the same numbers of missing values. Based on the names and numbers of missing values, I suspect that these two columns are related and so have the same numbers of missing values.

(data['last_review'].isnull()==data['reviews_per_month'].isnull()).all()True

So they should be handled together.

At this moment, I don’t know how to handle missing values in “last_review” column as this normally shows a date string indicating the date of the last review. So at this moment, I will just leave it first.

However, for missing values in reviews_per_month, I will assign them as zero.

data.loc[data['reviews_per_month'].isnull(),'reviews_per_month']=0

Remaining columns with missing values are name and host_name. However, I don’t think there is any predictive power from either name or host_name. So instead of handling missing values, I will directly remove from the dataset.

There are two columns, id and host_id, indicating the Airbnb house and host. After checking, id is unique in the dataset but not host_id. Therefore, id can be removed but not host_id. Also, it can be meaningful if there is any relationship between the number of listing house and price. So host_id will be kept.

Next step is to check numeric variables by using

data.describe()

Using describe() can help me understand the range of possible values for each variable. I can know the distribution of values. Also, it is possible to discover any unrealistic records from the dataset.

For example in this dataset, there are some abnormalities. The first one is that there are records with “price” = 0 which is unrealistic as this means no need to pay to stay. So these records should be removed.

The second is that the maximum of “minimum_nights” is 1250 which I believe is unrealistic (is this still an Airbnb or just another rental contract?). I put a threshold as 365 days. Any records with “minimum_nights” larger than 365 will also be removed.

Dataset after cleaning:

data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48870 entries, 0 to 48869
Data columns (total 12 columns):
host_id                           48870 non-null int64
neighbourhood_group               48870 non-null object
neighbourhood                     48870 non-null object
latitude                          48870 non-null float64
longitude                         48870 non-null float64
room_type                         48870 non-null object
price                             48870 non-null int64
minimum_nights                    48870 non-null int64
number_of_reviews                 48870 non-null int64
reviews_per_month                 48870 non-null float64
calculated_host_listings_count    48870 non-null int64
availability_365                  48870 non-null int64
dtypes: float64(3), int64(6), object(3)
memory usage: 4.5+ MB

48870 rows and 12 columns

Now I can have a look at the dataset. Each column at a time

The first is neighbourhood_group

set(data['neighbourhood_group']){'Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'}data.groupby('neighbourhood_group')['price'].describe()

YrMvYrn.png!web

This table easily shows the distribution of price in each neighbourhood_group and across all. The numbers of Airbnb is not even. Most Airbnb houses are in either Manhattan or Brooklyn. However, these two regions are also the highest prices among the five regions. A possible reason is that because the demands in these regions are high, causing more hosts to rent out their rooms or apartments.

To show more accurately on the price, a further breakdown with consideration of room type is carried.

data.pivot_table(index='neighbourhood_group',columns='room_type',values='price',aggfunc='mean')

Unyyamy.png!web

In average, you can get a private room in Brooklyn but you can’t even afford a shared room in Manhattan.

Next is the location of Airbnb. As already known there are more Airbnb in Manhattan and Brooklyn. Here I use a heat map to present.

ny_map = folium.Map(location=[40.7, -74],zoom_start =10)
data_loc= data[['latitude','longitude']].values
data_loc =data_loc.tolist()
hm = plugins.HeatMap(data_loc)
hm.add_to(ny_map)
ny_map

7zQ3Yzj.png!web

There is a large red area in Manhattan and Brooklyn.

It’s time to further breakdown in each small neighborhood.

for gp in set(data['neighbourhood_group']):  print(data.loc[data['neighbourhood_group']==gp,].groupby(['neighbour   hood_group','neighbourhood']).agg({'price':['count','mean']}).sort_values(by=('price', 'mean'),ascending=False).head())
print()

1*nqgJz2p_stgbcQCIiIN6aw.png?q=20

The above table shows the top 5 most expensive neighborhood in each group. Except for Staten Island which there is only 1 record in some neighbors for top 5 neighbors, others have enough records to correctly represent the average price in each region. Riverdale in Bronx is as expensive as in Manhattan.

Next variable to be studied is “minimum_nights” The first is to breakdown by “neighbourhood_group”

data.groupby('neighbourhood_group')['minimum_nights'].describe()

7zU7vey.png!web

As all records with ‘minimum_nights’ larger than 365 were removed already, the max is only 365. As shown, over 25% of Airbnb place only require 1 night and over half only require 2 or 3 nights which fits in the original principle of Airbnb service, a short term accommodation.

Now we can move to the next variable, views. Instead of using number_of_reviews, I would use reviews_per_month since this can eliminate the effect of duration listing on Airbnb.

for gp in set(data['neighbourhood_group']):
print(gp)
print(data[data['neighbourhood_group']==gp]    ['price','reviews_per_month']].corr())
print()

6f26JvI.png!web

Apparently there is no relationship between price and number of reviews per month.

FNb6Jnz.png!web

Instead, almost all houses only get few reviews per month.

Recommend

Outlier Detection with Hampel Filter

Hyperparameters Optimization

多网站项目的 CSS 架构

这些年的体验技术部（六） · Node.js 基础服务 - 摸爬滚打才不负功名尘土 - 知乎

Dark Mode Apps - List of Apps Supporting iOS Dark Mode

蔚来汽车辟谣“4年亏损57亿美元”：实为200亿人民币

刷脸支付迎巨头数十亿补贴：照片、整容也能解锁吗？

特斯拉上海工厂主体已建成：首辆白车身已下线

美国最大电子烟制造商Juul宣布停止投放广告

5G商用倒计时：超210万用户预约尚缺“杀手级”应用

About Joyk