0

Querying OSM places data in python

 8 months ago
source link: https://andrewpwheeler.com/2023/08/14/querying-osm-places-data-in-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Querying OSM places data in python

For updates on other blogs, we have:

For a quick post here, Simon Willison has a recent post on using duckdb to query OSM data made available by the Overture foundation. I have written in the past scraping places data (gas stations, stores, etc., often used in spatial crime analysis) from Google (or other sources), so this is a potentially free source.

So I was interested to check out the data, it is quite easy to download given the ability to query parquet data files online. Simon in his post said it was taking awhile though, and this example downloading data from Raleigh took around 25 minutes. So no good for a live API, but fine for a batch job.

This python is simple enough to just embed in a blog post:

import duckdb
import pandas as pd
from datetime import datetime
import requests

# setting up in memory duckdb + extensions
db = duckdb.connect()
db.execute("INSTALL spatial")
db.execute("INSTALL httpfs")
db.execute("""LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';""")

# Raleigh bound box from ESRI API
ral_esri = "https://maps.wakegov.com/arcgis/rest/services/Jurisdictions/Jurisdictions/MapServer/1/query?where=JURISDICTION+IN+%28%27RALEIGH%27%29&returnExtentOnly=true&outSR=4326&f=json"
bbox = requests.get(ral_esri).json()['extent']

# check out https://overturemaps.org/download/ for new releases
places_url = "s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*"
query = f"""
SELECT
  *
FROM read_parquet('{places_url}')
WHERE
  bbox.minx > {bbox['xmin']}
  AND bbox.maxx < {bbox['xmax']} 
  AND bbox.miny > {bbox['ymin']} 
  AND bbox.maxy < {bbox['ymax']}
"""

# Took me around 25 minutes
print(datetime.now())
res = pd.read_sql(query,db)
print(datetime.now())

And this currently queries 29k places in Raleigh. Places can have multiple categories, so here I just slice out the main category and check it out:

def extract_main(x):
    return x['main']

res['main_cat'] = res['categories'].apply(extract_main)

res['main_cat'].value_counts().head(30)

And this returns

>>> res['main_cat'].value_counts().head(30)
beauty_salon                         1291
real_estate_agent                     657
landmark_and_historical_building      626
church_cathedral                      567
community_services_non_profits        538
professional_services                 502
real_estate                           452
hospital                              405
automotive_repair                     396
dentist                               350
lawyer                                316
park                                  308
insurance_agency                      298
public_and_government_association     288
spas                                  265
financial_service                     261
gym                                   260
counseling_and_mental_health          256
religious_organization                240
car_dealer                            214
college_university                    185
gas_station                           179
hotel                                 170
contractor                            169
pizza_restaurant                      167
barber                                161
shopping                              160
grocery_store                         160
fast_food_restaurant                  160
school                                158

I can’t say anything about the coverage of this data. Looking nearby my house it appears pretty well filled in. There are additional pieces of info in the OSM data as well, such as a confidence score.

So definately a ton of potential to use that as a nice source for reproducible crime analysis (it probably has the major types of places most people are interested in looking at). But I would do some local checks for your data before wholesale recommending using the open street map data over an official business directory (if available – but that may not include things like ATMs) or Google Places API data (but this is free!)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK