29

A demonstration of carrying data analysis (New York City Airbnb Open Data)

 4 years ago
source link: https://www.tuicool.com/articles/EnmA3uV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

In this article, I will perform data analysis by using Python and record down every finding and thought during the analysis. Practicing with real data can not only help me have more experience in analyzing different data but also help me discover more techniques and skills when compared with others’ works.

3EVRZnz.jpg!web

Photo by Oliver Niblett on Unsplash

The dataset used in this article is “New York City Airbnb Open Data” which is from Kaggle. Here is the download link https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

After importing modules and csv file, normally my first step is to explore the dataset. This step includes studying the format of the overall dataset, types of all variables, checking and cleaning the dataset.

data = pd.read_csv('AB_NYC_2019.csv')
data.info()
class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id 48895 non-null int64
name 48879 non-null object
host_id 48895 non-null int64
host_name 48874 non-null object
neighbourhood_group 48895 non-null object
neighbourhood 48895 non-null object
latitude 48895 non-null float64
longitude 48895 non-null float64
room_type 48895 non-null object
price 48895 non-null int64
minimum_nights 48895 non-null int64
number_of_reviews 48895 non-null int64
last_review 38843 non-null object
reviews_per_month 38843 non-null float64
calculated_host_listings_count 48895 non-null int64
availability_365 48895 non-null int64
dtypes: float64(3), int64(7), object(6)
memory usage: 6.0+ MB

So there are in total of 16 columns and some columns have missing values.

data.isnull().sum()id                                    0
name 16
host_id 0
host_name 21
neighbourhood_group 0
neighbourhood 0
latitude 0
longitude 0
room_type 0
price 0
minimum_nights 0
number_of_reviews 0
last_review 10052
reviews_per_month 10052
calculated_host_listings_count 0
availability_365 0
dtype: int64

Both last_review and reviews_per_month have over 10k missing values in the dataset. And they also have the same numbers of missing values. Based on the names and numbers of missing values, I suspect that these two columns are related and so have the same numbers of missing values.

(data['last_review'].isnull()==data['reviews_per_month'].isnull()).all()True

So they should be handled together.

At this moment, I don’t know how to handle missing values in “last_review” column as this normally shows a date string indicating the date of the last review. So at this moment, I will just leave it first.

However, for missing values in reviews_per_month, I will assign them as zero.

data.loc[data['reviews_per_month'].isnull(),'reviews_per_month']=0

Remaining columns with missing values are name and host_name. However, I don’t think there is any predictive power from either name or host_name. So instead of handling missing values, I will directly remove from the dataset.

There are two columns, id and host_id, indicating the Airbnb house and host. After checking, id is unique in the dataset but not host_id. Therefore, id can be removed but not host_id. Also, it can be meaningful if there is any relationship between the number of listing house and price. So host_id will be kept.

Next step is to check numeric variables by using

data.describe()

Using describe() can help me understand the range of possible values for each variable. I can know the distribution of values. Also, it is possible to discover any unrealistic records from the dataset.

For example in this dataset, there are some abnormalities. The first one is that there are records with “price” = 0 which is unrealistic as this means no need to pay to stay. So these records should be removed.

The second is that the maximum of “minimum_nights” is 1250 which I believe is unrealistic (is this still an Airbnb or just another rental contract?). I put a threshold as 365 days. Any records with “minimum_nights” larger than 365 will also be removed.

Dataset after cleaning:

data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48870 entries, 0 to 48869
Data columns (total 12 columns):
host_id 48870 non-null int64
neighbourhood_group 48870 non-null object
neighbourhood 48870 non-null object
latitude 48870 non-null float64
longitude 48870 non-null float64
room_type 48870 non-null object
price 48870 non-null int64
minimum_nights 48870 non-null int64
number_of_reviews 48870 non-null int64
reviews_per_month 48870 non-null float64
calculated_host_listings_count 48870 non-null int64
availability_365 48870 non-null int64
dtypes: float64(3), int64(6), object(3)
memory usage: 4.5+ MB

48870 rows and 12 columns

Now I can have a look at the dataset. Each column at a time

The first is neighbourhood_group

set(data['neighbourhood_group']){'Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'}data.groupby('neighbourhood_group')['price'].describe()

YrMvYrn.png!web

This table easily shows the distribution of price in each neighbourhood_group and across all. The numbers of Airbnb is not even. Most Airbnb houses are in either Manhattan or Brooklyn. However, these two regions are also the highest prices among the five regions. A possible reason is that because the demands in these regions are high, causing more hosts to rent out their rooms or apartments.

To show more accurately on the price, a further breakdown with consideration of room type is carried.

data.pivot_table(index='neighbourhood_group',columns='room_type',values='price',aggfunc='mean')

Unyyamy.png!web

In average, you can get a private room in Brooklyn but you can’t even afford a shared room in Manhattan.

Next is the location of Airbnb. As already known there are more Airbnb in Manhattan and Brooklyn. Here I use a heat map to present.

ny_map = folium.Map(location=[40.7, -74],zoom_start =10)
data_loc= data[['latitude','longitude']].values
data_loc =data_loc.tolist()
hm = plugins.HeatMap(data_loc)
hm.add_to(ny_map)
ny_map

7zQ3Yzj.png!web

There is a large red area in Manhattan and Brooklyn.

It’s time to further breakdown in each small neighborhood.

for gp in set(data['neighbourhood_group']):  print(data.loc[data['neighbourhood_group']==gp,].groupby(['neighbour   hood_group','neighbourhood']).agg({'price':['count','mean']}).sort_values(by=('price', 'mean'),ascending=False).head())
print()

1*nqgJz2p_stgbcQCIiIN6aw.png?q=20

The above table shows the top 5 most expensive neighborhood in each group. Except for Staten Island which there is only 1 record in some neighbors for top 5 neighbors, others have enough records to correctly represent the average price in each region. Riverdale in Bronx is as expensive as in Manhattan.

Next variable to be studied is “minimum_nights” The first is to breakdown by “neighbourhood_group”

data.groupby('neighbourhood_group')['minimum_nights'].describe()

7zU7vey.png!web

As all records with ‘minimum_nights’ larger than 365 were removed already, the max is only 365. As shown, over 25% of Airbnb place only require 1 night and over half only require 2 or 3 nights which fits in the original principle of Airbnb service, a short term accommodation.

Now we can move to the next variable, views. Instead of using number_of_reviews, I would use reviews_per_month since this can eliminate the effect of duration listing on Airbnb.

for gp in set(data['neighbourhood_group']):
print(gp)
print(data[data['neighbourhood_group']==gp] ['price','reviews_per_month']].corr())
print()

6f26JvI.png!web

Apparently there is no relationship between price and number of reviews per month.

FNb6Jnz.png!web

Instead, almost all houses only get few reviews per month.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK