Using Wikipedia's API to find inconsistently hyphenated French names – Arthur O'... - JOYK Joy of Geek, Geek News, Link all geek

Using Wikipedia’s API to find inconsistently hyphenated French names

The other day I noticed that a lot of English Wikipedia articles about French people for some reason use a space between parts of the given name where primary sources and/or the French Wikipedia use a hyphen; for example, the English Wikipedia (as of May 2024) has “Marie Thérèse Geoffrin” where both the Bibliothèque nationale and Encyclopaedia Britannica have “Marie-Thérèse Geoffrin.”

I wrote a Python script that uses the Mediawiki API to enumerate the articles with this inconsistency, and generate requested moves for each page where the French article’s title begins with e.g. “Marie-Thérèse” and the English article’s title begins with “Marie Thérèse.” Wikipedia’s API documentation is a little scattered, but really great, as documentation goes. It includes lots of runnable examples. My script ended up using:

The first of these lets us query for all French Wikipedia articles whose titles begin with e.g. "Jean-":

def frtitles_for_name(name):
  S = requests.Session()
  PARAMS = {
    "action": "query",
    "format": "json",
    "list": "allpages",
    "aplimit": "500",
    "apprefix": (name + '-'),
    "apfilterredir": "nonredirects",
  }
  while True:
    R = S.get(url="https://fr.wikipedia.org/w/api.php", params=PARAMS)
    DATA = R.json()
    for r in DATA["query"]["allpages"]:
      yield r["title"]
    if "continue" not in DATA:
      break
    PARAMS.update(DATA["continue"])

The second lets us find the corresponding English Wikipedia article and check whether its title begins with "Jean-" or with "Jean ":

def entitles_for_frtitle(frtitle):
  S = requests.Session()
  R = S.get(url="https://www.wikidata.org/w/api.php", params={
    "action": "wbgetentities",
    "sites": "frwiki",
    "titles": frtitle,
    "props": "sitelinks",
    "format": "json",
  })
  DATA = R.json()
  for r in DATA["entities"].values():
    entitle = r.get("sitelinks", {}).get("enwiki", {}).get("title", None)
    if entitle is not None:
      yield entitle

I used yield to make both functions into generators, even though we expect entitles_for_frtitle to return only a single result (or zero results), simply because that lets me write a nice clean main function like this:

for name in ['Anne', 'Claude', 'Guy', 'Jean']:
  for frtitle in frtitles_for_name(name):
    for entitle in entitles_for_frtitle(frtitle):
      hyphenated_name = frtitle.split()[0].strip(',')
      spaced_name = hyphenated_name.replace('-', ' ')
      if entitle.startswith(spaced_name):
        new_entitle = hyphenated_name + entitle[len(hyphenated_name):]
        print('* [[:%s]] → ' % (entitle, new_entitle))

This does miss a few cases, e.g. when the English title is missing not only the hyphen but also an accent (e.g. “Henri Evrard” for “Henri-Évrard”). And it has false positives for cases that aren’t personal names, for example the name of the 1956 film Marie Antoinette Queen of France (French: Marie-Antoinette reine de France). But it’s pretty darn good as a first pass.

See the complete Python script here, and look for those Wikipedia pages to get moved at some point in the near future.

Using Wikipedia's API to find inconsistently hyphenated French names – Arthur O'...

Recommend

Taiwan is experiencing millions of cyberattacks every day—the world should be pa...

How to Move the Largest Camera from California Lab to Andes Mountaintop

A short history of open source business conferences

Warren Buffett cut his stake in Apple before the latest earnings

Delete Class assignment Button Not Visible

Warren Buffett’s Berkshire Hathaway sells 13% of its Apple shares

Vera Rubin's primary mirror gets its first reflective coating

Jack Dorsey's Block Is Investing 10% Of Its Bitcoin Profits Into Monthly Bitcoin...

Superfest – The almost unbreakable East German Glass

TouchArcade Game of the Week: ‘Backflip Madness 2’

About Joyk