0

Using Wikipedia's API to find inconsistently hyphenated French names – Arthur O'...

 1 week ago
source link: https://quuxplusone.github.io/blog/2024/05/04/wikipedia-api/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Using Wikipedia’s API to find inconsistently hyphenated French names

The other day I noticed that a lot of English Wikipedia articles about French people for some reason use a space between parts of the given name where primary sources and/or the French Wikipedia use a hyphen; for example, the English Wikipedia (as of May 2024) has “Marie Thérèse Geoffrin” where both the Bibliothèque nationale and Encyclopaedia Britannica have “Marie-Thérèse Geoffrin.”

I wrote a Python script that uses the Mediawiki API to enumerate the articles with this inconsistency, and generate requested moves for each page where the French article’s title begins with e.g. “Marie-Thérèse” and the English article’s title begins with “Marie Thérèse.” Wikipedia’s API documentation is a little scattered, but really great, as documentation goes. It includes lots of runnable examples. My script ended up using:

The first of these lets us query for all French Wikipedia articles whose titles begin with e.g. "Jean-":

def frtitles_for_name(name):
  S = requests.Session()
  PARAMS = {
    "action": "query",
    "format": "json",
    "list": "allpages",
    "aplimit": "500",
    "apprefix": (name + '-'),
    "apfilterredir": "nonredirects",
  }
  while True:
    R = S.get(url="https://fr.wikipedia.org/w/api.php", params=PARAMS)
    DATA = R.json()
    for r in DATA["query"]["allpages"]:
      yield r["title"]
    if "continue" not in DATA:
      break
    PARAMS.update(DATA["continue"])

The second lets us find the corresponding English Wikipedia article and check whether its title begins with "Jean-" or with "Jean ":

def entitles_for_frtitle(frtitle):
  S = requests.Session()
  R = S.get(url="https://www.wikidata.org/w/api.php", params={
    "action": "wbgetentities",
    "sites": "frwiki",
    "titles": frtitle,
    "props": "sitelinks",
    "format": "json",
  })
  DATA = R.json()
  for r in DATA["entities"].values():
    entitle = r.get("sitelinks", {}).get("enwiki", {}).get("title", None)
    if entitle is not None:
      yield entitle

I used yield to make both functions into generators, even though we expect entitles_for_frtitle to return only a single result (or zero results), simply because that lets me write a nice clean main function like this:

for name in ['Anne', 'Claude', 'Guy', 'Jean']:
  for frtitle in frtitles_for_name(name):
    for entitle in entitles_for_frtitle(frtitle):
      hyphenated_name = frtitle.split()[0].strip(',')
      spaced_name = hyphenated_name.replace('-', ' ')
      if entitle.startswith(spaced_name):
        new_entitle = hyphenated_name + entitle[len(hyphenated_name):]
        print('* [[:%s]] → ' % (entitle, new_entitle))

This does miss a few cases, e.g. when the English title is missing not only the hyphen but also an accent (e.g. “Henri Evrard” for “Henri-Évrard”). And it has false positives for cases that aren’t personal names, for example the name of the 1956 film Marie Antoinette Queen of France (French: Marie-Antoinette reine de France). But it’s pretty darn good as a first pass.

See the complete Python script here, and look for those Wikipedia pages to get moved at some point in the near future.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK