What is Exploratory Spatial Data Analysis (ESDA)?

Hint: Not the usual EDA. A Guide on how to get insights from your data using Spatial Exploratory Data Analysis (Spatial Autocorrelation)

Abdishakur

Nov 5 ·7min read

b2yeI3F.jpg!web

Photo by fabio on Unsplash

What do you do when you want to explore patterns from data based on locations. How do you know that your location data is not random? Is it enough to use the Correlation? Or are there any other statistical methods used for this kind of exploratory data analysis.

In this tutorial, I show you how you can perform Exploratory data analysis for your location data with simple and easy steps used Python. The code is available also for this tutorial in Github.

Exploratory Spatial Data Analysis

In Data Science, we tend to explore and investigate data before doing any modeling or processing task. This helps you identify patterns, summarize the main characteristics of the data, or test a hypothesis. The conventional Exploratory data analysis does not investigate the location component of the dataset explicitly but instead deals with the relationship between variables and how they affect each other. Correlation statistical methods are often used to explore the relationship between variables.

In contrast, Exploratory Spatial Data Analysis (ESDA) correlates a specific variable to a location, taking into account the values of the same variable in the neighborhood. The methods used for this purpose are called Spatial Autocorrelation.

Spatial autocorrelation is describing the presence (or absence) of spatial variations in a given variable. Like, conventional correlation methods, Spatial autocorrelation has positive and negative values. Positive spatial autocorrelation is when areas close to each other have similar values (High-high or Low-low). On the other hand, negative spatial autocorrelation indicates that neighborhood areas to be different (Low values next to high values).

jUjyeu2.png!web

Spatial Autocorrelation: Source

There are mainly two methods of Exploratory Spatial Data Analysis (ESDA): global and local spatial autocorrelation. The global spatial autocorrelation focuses on the overall trend in the dataset and tells us if the degree of clustering int eh dataset. In contrast, The local spatial autocorrelation detects variability and divergence in the dataset, which helps us identify hot spots and cold spots in the data.

Getting the data

We use the Airbnb dataset (Point dataset) and Layer Super Output Areas — LSOA — neighborhoods (Polygon dataset) in London for this tutorial. We do spatial join to connect each point of Airbnb listings to neighborhood areas. If you like to understand and use the powerfull spatial join tool in your workflow. I have a tutorial here:

How to easily join data by location in Python — Spatial join

How to do spatial join easily in Python and Why it is a powerful tool often ignored in data science.

towardsdatascience.com

The dataset we use is spatially joined Airbnb properties in London with an average price of properties in each local area (Neighbourhood).

For this tutorial, we use Pandas, Geopandas, and Python Spatial Analysis Library (Pysal) libraries. So let us import these libraries.

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as pltimport pysal
from pysal import esda, weights
from esda.moran import Moran, Moran_Localimport splot
from splot.esda import moran_scatterplot, plot_moran, lisa_cluster

We can read the data in Geopandas.

avrg_price_airbnb = gpd.read_file(“london-airbnb-avrgprice.shp”)
avrg_price_airbnb.head()

Here are the first 5 rows of the Average prices of Airbnb properties in London.

zIFVzuq.png!web

Since we have a geometry column (Latitude and Longitude), we can map the data. And here is a choropleth map of the average prices per neighborhood.

ZrIfYjJ.png!web

choropleth map — average prices of Airbnb properties in London.

Well, with this choropleth map, we can see binned price ranges, but that does not give us any statistics we can determine if there is spatial autocorrelation (Positive or Negative, or even where the hotspots and coldspots are. That is what we do next.

Spatial Weights and Spatial Lag

Before we perform any spatial autocorrelation, we first need to determine the spatial weights and spatial lag.

Spatial weights are how we determine the area’s neighborhood. There are different statistical methods that are used for determining spatial weights, and it is beyond this to provide an in-depth explanation of each in this article. One of the most commonly used spatial weights methods is Queen Contiguity Matrix, which we use. Here is a diagram explaining how the Queen contiguity matrix works ( included also is the rook contiguity matrix)

imQzmam.png!web

Contiguity Matrix Source

To calculate Queen contiguity spatial weights, we use Pysal.

w = weights.Queen.from_dataframe(avrg_price_airbnb, idVariable=”LSOA_CODE” )w.transform = "R"

Spatial Lag is, on the other hand, is the product of spatial weights matrix for a given variable (in our case, the price). The spatial leg standardizes the rows and takes the average result of the price in each weighted neighborhood.

avrg_price_airbnb[“w_price”] = weights.lag_spatial(w, avrg_price_airbnb[“price”])

Now, we created a new column in our table that holds the weighted price of each neighborhood.

Global Spatial Autocorrelation

Global spatial autocorrelation determines the overall pattern in the dataset. Here we can calculate if there is a trend and summarize the variable of interest. Moran’s I statistics is typically used to determine the global spatial autocorrelation, so let us calculate that.

y = avrg_price_airbnb[“price”]
moran = Moran(y, w)
moran.I

And we get this number for this dataset 0.54 . What does this number mean? This number summarises the statistics of the dataset, just like the mean does for non-spatial data. Moran’s I values range from -1 to 1. In our case, this number provides information that there is a positive spatial autocorrelation in this dataset. Remember that we are determining only the global autocorrelation with Moran’s I statistics. It does not tell us where this positive spatial autocorrelation exists ( We do that next).

We use Moran’s I plot to visualize the global spatial autocorrelation, which is identical to other scatter plots, with a linear fit that shows the relationship between the two variables.

fig, ax = moran_scatterplot(moran, aspect_equal=True)
plt.show()

fymYbuQ.png!web

Moran’s I Scatter Plot

Both Moran’s I and Moran’s I Scatter plot show positively correlated observations by location in the dataset. Let us see where we have spatial variations in the dataset.

Local Spatial Autocorrelation

So far, we have only determined that there is a positive spatial autocorrelation between the price of properties in neighborhoods and their locations. But we have not detected where clusters are. Local Indicators of Spatial Association (LISA) is used to do that. LISA classifies areas into four groups: high values near to high values (HH), Low values with nearby low values (LL), Low values with high values in its neighborhood, and vice-versa.

We had already calculated the weights (w) and determined the price as our variable of interest(y). To calculate Moran Local, we use Pysal’s functionality.

# calculate Moran Local 
m_local = Moran_Local(y, w)

And plot Moran’s Local Scatter Plot.

# Plot
fig, ax = moran_scatterplot(m_local, p=0.05)
ax.set_xlabel(‘Price’)
ax.set_ylabel(‘Spatial Lag of Price’)
plt.text(1.95, 0.5, “HH”, fontsize=25)
plt.text(1.95, -1.5, “HL”, fontsize=25)
plt.text(-2, 1, “LH”, fontsize=25)
plt.text(-1, -1, “LL”, fontsize=25)
plt.show()

The scatter plot divides the areas into the four groups, as we mentioned.

I7neM3q.png!web

Moran Local Scatter Plot — LISA

Now, this is cool, and we can see all values classified into four groups, but the exciting part is to see where these values cluster together in a map. Again, there is a function in Pysal (splot) to plot a map of the LISA results.

M7JJjyy.png!web

LISA Cluster Map -Airbnb Average price per neighborhood.

The map above shows the variation in the average price of Airbnb Properties. The red colors indicate neighborhoods clustered together, which have high prices surrounded by high prices as well (mostly the center of the city). The blue areas indicate where prices are low, also surrounded by areas with low-value prices (Mostly peripheries). Equally interesting is also Low-high and High-low area concentration.

Compared to the Choropleth map we started with this tutorial, the LISA is much more decluttered and provides a clear picture of the dataset. Exploratory Spatial Data Analysis (ESDA) techniques are powerful tools that help you identify spatial autocorrelation and local clusters that you can apply in any given variable.

Conclusion

In this tutorial, we have explored how we can perform Exploratory Data Analysis (EDA) for spatial data. The code for this tutorial is available in this GitHub with the notebooks and the data.

shakasom/esda

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

You can also directly run a Google Colab Notebook from here:

Google Colaboratory

Edit description

colab.research.google.com