Going Dutch, Part 2: Improving a Machine Learning Model Using Geographical Data

Where it all Started

In my previous post, I described the process of hunting for an apartment in Amsterdam using Data Science and Machine Learning . Using apartment rental data obtained from the internet, I was able to explore and visualize this data. As part of the visualization part I was able to create the map below:

Ultimately I was able to use this data to build , train and test a predictive model using Random Forests . It was possible to achieve an R2 score of 0.70, which is a good measure for a baseline model. The results in terms of predictions versus actual values for the test set can be seen in the plot below:

The idea of creating a predictive model out of this data was to have a good parameter in order to know if a rental listing had a fair price or not. This would allow us to find some bargains or distortions. The reasoning behind this was that if I came across any apartment with a rental price that was much lower than a value predicted by our model, this could indicate a good deal. In the end, this proved to be an efficient way for house hunting in Amsterdam as I was able to focus on specific areas, detecting some good deals and finding an apartment within my first day in the city.

After finding an apartment in AMS

Now going back to our model. Being a baseline model, there is potential room for improvement in terms of prediction quality. The pipeline below has become my favorite approach for tackling data problems:

Start out with a baseline model
Check the results
Improve the model
Repeat until the results are satisfactory

So here we are now, at step three. We need to improve our model. In the last post, I listed the reasons why I like Random Forests so much, one of them being the fact that you don’t need to spend a lot of time tuning hyperparameters (which is the case for Neural Networks for example). While this is a good thing, on the other hand it imposes us some limitations in order to improve our model. We are pretty much left with working and improving our data rather than trying to improve our predictive model by tweaking its parameters.

This is one of the beauties of Data Science . Sometimes it feels like an investigation job: you need to look for leads and connect the dots. It’s almost like the truth is out there.

Look! An empty apartment in Amsterdam!

So now we know that we need to work on our data and make it better. But how?

In our original dataset, we were able to apply some feature engineering in order to create some variables related to the apartment location. We created dummy variables for some categorical features, such as address and district . This way we ended up creating one variable for each address and district category, their value being either 0 or 1. But there is one aspect that we missed in this analysis.

Venice of the North

Amsterdam has more than one hundred kilometers of grachten ( canals ), about 90 islands and 1,500 bridges. Alongside the main canals are 1550 monumental buildings. The 17th-century canal ring area, including the Prinsengracht, Keizersgracht, Herengracht and Jordaan, were listed as UNESCO World Heritage Site in 2010, contributing to Amsterdam’s fame as the “ Venice of the North ”.

The addresses in Amsterdam usually contain information which allows one to know if a place sits in front of a canal or not — the so called “ canal houses ”. An apartment located at Leidsegracht most likely has a view to the canal, while the same cannot be said for an apartment located at Leidsestraat, for instance. There are also cases where buildings are located within a square, for example some buldings at Leidseplein .

Needless to say, canal houses have an extra appeal due to the beautiful view they provide. In Facebook groups it is possible to see canal houses being rented in the matter of hours after being listed.

Who wants to live in a Canal House?

We will extract this information from the apartments addresses in order to create three more variables: gracht , straat and plein , with 0 and 1 as possible values. The reason for creating separate variables for these instead of only one with different possible values (e.g. 1,2,3) is that in this case we would treat it as a continuous variable, tricking our model into considering this scale for means of importance. We will hopefully find out if canal houses are really that sought for.

Location, Location, Location

By observing the top 15 most important features from our model’s Feature Importance ranking, we are able to notice that besides latitude and longitude , many of the dummy variables related to address and district are valuable for our model in order to properly do its predictions. So we can say that location data is definitely promising.

Top 15 Most Important Features for our baseline model.

We will expand on that. But again, the question is, how?

Restaurants, bars and cafes near Leidseplein, Amsterdam.

Going Social

Besides having around 800K inhabitants — a small number when it comes to Western Europe capitals — Amsterdam is packed with bars, cafes, restaurants and has one of the best public transportation systems in Europe. However, it is important to think about our target subject: people . Our main objective here is understanding how people behave when looking for an apartment . We need to segregate our target into different groups, in a way that we can understand what they want in terms of housing and location. One could argue that being close to bars, cafes and restaurants is attractive for some kinds of people. Other people could be willing to live closer to parks and schools — couples with small kids, for example. Both types of people could also be interested in living close to public transportation, such as Tram and Bus stops.

Amsterdam Centraal Station

This gives us some hints on formulating our hypothesis. Would proximity to these types of places impact apartment rental prices?

In order to test this hypothesis, we need to feed this data into our model.

Yelp is a social platform that advertises its purpose as “ to connect people with great local businesses” . It lists spots such as bars, restaurants, schools and lots of other types of POI — points of interest throughout the world, allowing users to write reviews for places. Yelp had a monthly average of 30 million unique visitors who visited Yelp via the Yelp app and 70 million unique visitors who visited Yelp via mobile web in Q1 2018. Through Yelp’s Fusion API one can easily extract POI information passing latitude and longitude as parameters.

Yelp has some categories for places listed on its database (unfortunately, there is no category for coffeeshop ). We are interested in the following:

Active Life: parks, gyms, tennis courts, basketball courts

Bar: bars and pubs

Cafe: self-explanatory

Education: kindergarten, high schools and universities

Hotels/Travel: hotels, car rental shops, touristic information points

Transportation: tram/bus stops and metro stations

For each of these categories, Yelp lists POI containing their latitude and longitude.

Our approach here will be:

Querying the Yelp Fusion API in order to get data on POI for the categories above in Amsterdam
Calculating the distance in meters between each apartment and each POI .
Counting how many POIs from each category are within a 250 meters radius from each of the apartments . These numbers will become variables in our dataset.

By using Yelp’s Fusion API we have been able to grab geographic data on 50 POI for each of the categories that are within our target.

Before we proceed to calculating the distance between each POI and each apartment, remember: Latitude and Longitude are angle measures.

Latitudeis measured as the degrees to the north or south of the Equator . Longitude is measured as the degrees to the east or west of the Prime Meridian (or Greenwich) line . The combination of these two angles can be used to pinpoint an exact location on the surface of the earth.

As shown in the image above, the quickest route between two points on the surface of the earth is a “ great circle path ” — in other words, a path that comprises a part of the longest circle you could draw around the globe that intersects the two points. And, since this is a circular path on a sphere using coordinates expressed in angles, all of the properties of the distance will be given by trigonometric formulas.

Haversine Formula.

The shortest distance between two points on the globe can be calculated using the Haversine formula . In Python, it would look like this:

from math import sin, cos, sqrt, atan2, radians

# approximate radius of earth in km
R = 6373.0

lat1 = radians(52.2296756)
lon1 = radians(21.0122287)
lat2 = radians(52.406374)
lon2 = radians(16.9251681)

dlon = lon2 - lon1
dlat = lat2 - lat1

a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))

distance = R * c

Now that we know how to calculate the distance between two points, for each apartment we will count how many POI from each of the categories are within a 250 meters radius.

After concatenating this data into our previous dataset, let’s have a glimpse at how it looks — notice the last columns on the right side:

Putting it all Together

Let’s take a look at our dataset after adding these variables.

We’ll start out by getting some measures for descriptive statistics:

Apartments have an average 0.7828 bars within 250 meters radius. Good news for those who like to go for a beer without walking (or tramming) too much.
Cafes also seem to be widely spread across the city, with an average 0.6703 POI within walking distance from each apartment.
The average quantity of transportation POI within 250 meters of apartments is close to zero.

We’ll now generate some boxplots for these variables and see how they influence normalized_price .

It looks like these three variables indeed have significant influence over normalized_price .

What about gracht , straat and plein ?

As expected, prices are slightly higher for canal houses (gracht) and also for apartments located in squares (plein). For apartments located in regular streets, prices are slightly lower.

Now we’ll wrap everything up and generate a heatmap through the Pearson Correlation matrix between the new variables we introduced in the model and our target variable, normalized_price .

Unfortunately, our heatmap does not provide any indication of significant correlation between our new variables and normalized_price . But again, that doesn’t mean there is no relationship at all between them, it just means there is no significant linear relationship.

Going Green, Part 2

Now that we enriched our dataset, it’s time to train our model with the new data and see how it performs.

How our new predictions perform. Predicted values in orange, actual values in blue.

We were able to increase our R2 Score from 0.70 to 0.75 — around 7.14% improvement considering our baseline model.

The plot above depicts the new predicted values we obtained in comparison with the actual ones. It is possible to see a small improvement specially for predicting values close to the maximum and minimum prices.

Top 15 Most Important Features for the second version of our model.

In terms of feature importances, something interesting occurred. Some of the new variables that we introduced gained substantial importance, thus removing importance from other variables. Notice how the dummy variables generated by the address variable lost importance. In the case of the district variables, they are not even part of the top 15 most important variables anymore. Probably we wouldn’t see much difference in our results should we want to remove these variables from our model.

It is also interesting to note that the quantity of transportation POI within a 250 meters radius is not as important as the quantity of cafes within this distance. One possible guess is that transportation is more homogeneously dispersed through the city — most apartments would be close to trams, bus or metro stops, while cafes might be concentrated in more central areas.

We explored much of the geographic data available. Perhaps a way to make our model even better would be getting information such as building construction date, apartment conditions and other characteristics. We could even play a bit with the geographic data and base our variables in a greater radius than 250 meters for POI. It is also possible to explore other Yelp categories such as shops, grocery stores, among many others and see how they affect rental prices.

Key Takeaways

Improving a model’s prediction capacity is not a trivial task and may require a bit of creativity to find ways of making our data richer and more comprehensive
Sometimes a small model improvement requires a decent amount of work
In the predictive analytics pipeline, obtaining, understanding, cleaning and enriching your data is a critical step which is sometimes overlooked; it is also the most time consuming task — and in this case, it was also the most fun part

I hope to share the code for this project shortly. In the meantime, you can:

Follow me here on Medium to get up to date with my thoughts and insights on data
Take a look at the previous post from this series, Going Dutch: How I Used Data Science and Machine Learning to Find an Apartment in Amsterdam, Part 1
Connect with me on LinkedIn

Last but not least:

Where it all Started

Venice of the North

Location, Location, Location

Going Social

Putting it all Together

Going Green, Part 2

Key Takeaways

Recommend

500px has shut down access to it's API

CPU Exceptions

Jonathan H. Wage on Twitter: "Deprecate == and != in PHP! ? https://t.co/8t...

Computer vision and machine learning in PHP using the opencv library

GitHub - fisharebest/webtrees: Online genealogy

GitHub - awssat/laravel-visits: ? Laravel Visits is a counter that can be attach...

Vynchronize - A Fun Realtime Video Synchronization Platform For Friends!

CAF/README.md at master · getify/CAF · GitHub

库克：我不后悔公布自己是同性恋

GitHub - khlieng/dispatch: Web-based IRC client in Go.

About Joyk