Querying OSM places data in python
source link: https://andrewpwheeler.com/2023/08/14/querying-osm-places-data-in-python/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Querying OSM places data in python
For updates on other blogs, we have:
- CRIME De-Coder, Using data to establish reasonableness in premises liability cases, I go over ways to make premises liability claims more empirically rigorous at the reasonableness stage.
- ASEBP, Cost-benefit analysis of Gun Shot Detection Tech, I estimate that GSD will save 1 life per 100 gun shot victims, and other crime reduction benefits of GSD are pretty weak sauce.
For a quick post here, Simon Willison has a recent post on using duckdb to query OSM data made available by the Overture foundation. I have written in the past scraping places data (gas stations, stores, etc., often used in spatial crime analysis) from Google (or other sources), so this is a potentially free source.
So I was interested to check out the data, it is quite easy to download given the ability to query parquet data files online. Simon in his post said it was taking awhile though, and this example downloading data from Raleigh took around 25 minutes. So no good for a live API, but fine for a batch job.
This python is simple enough to just embed in a blog post:
import duckdb
import pandas as pd
from datetime import datetime
import requests
# setting up in memory duckdb + extensions
db = duckdb.connect()
db.execute("INSTALL spatial")
db.execute("INSTALL httpfs")
db.execute("""LOAD spatial;
LOAD httpfs;
SET s3_region='us-west-2';""")
# Raleigh bound box from ESRI API
ral_esri = "https://maps.wakegov.com/arcgis/rest/services/Jurisdictions/Jurisdictions/MapServer/1/query?where=JURISDICTION+IN+%28%27RALEIGH%27%29&returnExtentOnly=true&outSR=4326&f=json"
bbox = requests.get(ral_esri).json()['extent']
# check out https://overturemaps.org/download/ for new releases
places_url = "s3://overturemaps-us-west-2/release/2023-07-26-alpha.0/theme=places/type=*/*"
query = f"""
SELECT
*
FROM read_parquet('{places_url}')
WHERE
bbox.minx > {bbox['xmin']}
AND bbox.maxx < {bbox['xmax']}
AND bbox.miny > {bbox['ymin']}
AND bbox.maxy < {bbox['ymax']}
"""
# Took me around 25 minutes
print(datetime.now())
res = pd.read_sql(query,db)
print(datetime.now())
And this currently queries 29k places in Raleigh. Places can have multiple categories, so here I just slice out the main category and check it out:
def extract_main(x):
return x['main']
res['main_cat'] = res['categories'].apply(extract_main)
res['main_cat'].value_counts().head(30)
And this returns
>>> res['main_cat'].value_counts().head(30)
beauty_salon 1291
real_estate_agent 657
landmark_and_historical_building 626
church_cathedral 567
community_services_non_profits 538
professional_services 502
real_estate 452
hospital 405
automotive_repair 396
dentist 350
lawyer 316
park 308
insurance_agency 298
public_and_government_association 288
spas 265
financial_service 261
gym 260
counseling_and_mental_health 256
religious_organization 240
car_dealer 214
college_university 185
gas_station 179
hotel 170
contractor 169
pizza_restaurant 167
barber 161
shopping 160
grocery_store 160
fast_food_restaurant 160
school 158
I can’t say anything about the coverage of this data. Looking nearby my house it appears pretty well filled in. There are additional pieces of info in the OSM data as well, such as a confidence score.
So definately a ton of potential to use that as a nice source for reproducible crime analysis (it probably has the major types of places most people are interested in looking at). But I would do some local checks for your data before wholesale recommending using the open street map data over an official business directory (if available – but that may not include things like ATMs) or Google Places API data (but this is free!)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK