3

A small DOCUMERICA Twitter bot

 2 years ago
source link: https://blog.yossarian.net/2021/10/25/A-small-documerica-twitter-bot
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
A small DOCUMERICA Twitter bot

ENOSUCHBLOG

Programming, philosophy, pedaling.


A small DOCUMERICA Twitter bot

Oct 25, 2021

Tags: art, data, devblog, python


TL;DR: I’ve written a Twitter bot that posts pictures from the DOCUMERICA program. The code for getting the DOCUMERICA photos, building a DB, and the Twitter bot itself is all here.


I’m taking a break this month from toiling in the LLVM mines to do something a little bit lighter.

From 1972 to 1977, the U.S.’s Environmental Protection Agency ran the DOCUMERICA program: dozens of freelance photographers across the country were paid to “photographically document subjects of environmental concern.”

I’ve loved the DOCUMERICA photos ever since I first saw them. Most of the photographers paid through the program interpreted the task broadly, leaving us an incredible archive of 1970s American society and its environs:

Over 80,000 photos were taken through DOCUMERICA, of which approximately 22,000 were curated for storage at the National Archives. From those 22,000 or so, 15,992 have been digitized and are available as an online collection.

The National Archives has done a fantastic job of keeping the DOCUMERICA records, but I’ve always wanted to (try to) expose them to a wider audience, once that might understandably be less interested in crawling through metadata to find pieces of Americana. I’ve already written a few Twitter bots and the DOCUMERICA photos seemed a good target for yet another, so why not?

Getting the data

To the best of my knowledge, the National Archives is the only (official) repository for the photographs taken through the DOCUMERICA program. They also appear to have uploaded a more curated collection to Flickr, but it’s not nearly as complete.

Fortunately, the National Archives have a Catalog API that includes searching and bulk exporting. Even more fortunately1 (and unusually for library/archive APIs), it actually supports JSON!

The Catalog API allows us to retrieve 10,000 results per request and there are only around 16,000 “top-level” records in the DOCUMERICA volume, so we only need to do two requests to get all of the metadata for the entire program:

base="https://catalog.archives.gov/api/v1"
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&sort=naId%20asc" > 1.sort.json
curl "${base}/?description.item.parentSeries.naId=542493&rows=10000&offset=10000&sort=naId%20asc" > 2.sort.json

A few funny things with this:

  • The National Archives are structured hierarchically: every entity has a National Archives ID (NAID), which in turn can be used to query its descendants. That’s what the crazy filter parameter (description.item.parentSeries.naId) is doing here: it’s telling the API to limit results to only those entities whose parent series is 542493, i.e. the top-level identifier for the DOCUMERICA program.
  • Sorting on the NAIDs is critical: without that, the API will return not just the children of 542493 but also their children (i.e., individual little blobs of JSON for every variant of every file). Those children fortunately (and coincidentally?) don’t have sequential NAIDs while the first-level child DOCUMERICA records do, so adding the sort gets us exactly what we want.

From here, we can confirm that we got as many records as we expected:

$ jq -c '.opaResponse.results.result | .[]' 1.sort.json 2.sort.json | wc -l
15992

Normalization

Structuring archival data is a complicated, unenviable task. For reasons that escape my hobbyist understanding, the layout returned by the National Archives API has some…unusual features:

  • The results key is not an array, but a dictionary. The actual array of results is in result.result.
  • There are quite a few keys that end with Array (e.g. geographicReferenceArray) that are actually dictionaries. I don’t know if this is an Archives-specific terminology choice, or whether it implies that these values can be arrays if more than one datapoint is available.
  • Similarly to the above: the object and file keys use the same mixed dictionary-array pattern. By way of example, here’s dictionary form of object from just a single record:

      "objects":{
        "@created":"2015-01-01T00:00:00Z",
        "@version":"OPA-OBJECTS-1.0",
        "object":[
          {
            "@id":"14676552",
            "@objectSortNum":"1",
            "technicalMetadata":{
              "size":"137930",
              "mime":"image\/gif",
              "Chroma_BlackIsZero":"true",
              "Chroma_ColorSpaceType":"RGB",
              "Chroma_NumChannels":"3",
              "Compression_CompressionTypeName":"lzw",
              "Compression_Lossless":"true",
              "Compression_NumProgressiveScans":"4",
              "height":"600",
              "width":"405"
            },
            "file":{
              "@mime":"image\/gif",
              "@name":"01-0237a.gif",
              "@path":"content\/arcmedia\/media\/images\/1\/3\/01-0237a.gif",
              "@type":"primary",
              "@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/content\/arcmedia\/media\/images\/1\/3\/01-0237a.gif"
            },
            "thumbnail":{
              "@mime":"image\/jpeg",
              "@path":"opa-renditions\/thumbnails\/01-0237a.gif-thumb.jpg",
              "@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/opa-renditions\/thumbnails\/01-0237a.gif-thumb.jpg"
            },
            "imageTiles":{
              "@path":"opa-renditions\/image-tiles\/01-0237a.gif.dzi",
              "@url":"https:\/\/catalog.archives.gov\/OpaAPI\/media\/542495\/opa-renditions\/image-tiles\/01-0237a.gif.dzi"
            },
          },
          {
            "@id":"209221188",
            "@objectSortNum":"2",
            "file":[
              {
                "@mime":"image\/jpeg",
                "@name":"412-DA-00002_01-0237M.jpg",
                "@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg",
                "@type":"primary",
                "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg"
              },
              {
                "@mime":"image\/tiff",
                "@name":"412-DA-00002_01-0237M.TIF",
                "@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF",
                "@type":"archival",
                "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF"
              }
            ],
            "thumbnail":{
              "@mime":"image\/jpeg",
              "@path":"opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg",
              "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg"
            },
            "imageTiles":{
              "@path":"opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi",
              "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi"
            }
          },
          {
            "@id":"209439452",
            "@objectSortNum":"3",
            "file":[
              {
                "@mime":"image\/jpeg",
                "@name":"412-DA-00002_01-0237M.jpg",
                "@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg",
                "@type":"primary",
                "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg"
              },
              {
                "@mime":"image\/tiff",
                "@name":"412-DA-00002_01-0237M.TIF",
                "@path":"\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF",
                "@type":"archival",
                "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/lz\/stillpix\/412-da\/412-DA-00002_01-0237M.TIF"
              }
            ],
            "thumbnail":{
              "@mime":"image\/jpeg",
              "@path":"opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg",
              "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/thumbnails\/412-DA-00002_01-0237M.jpg-thumb.jpg"
            },
            "imageTiles":{
              "@path":"opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi",
              "@url":"https:\/\/catalog.archives.gov\/catalogmedia\/live\/stillpix\/412-da\/412-DA-00002_01-0237M.jpg\/opa-renditions\/image-tiles\/412-DA-00002_01-0237M.jpg.dzi"
            }
          }
        ]
      }
    

I took this as a challenge to practice my jq skills, and came up with this mess:

jq -c \
  '.opaResponse.results.result |
   .[] |
   {
      naid: .naId,
      title: .description.item.title,
      author: .description.item.personalContributorArray.personalContributor.contributor.termName,
      date: .description.item.productionDateArray.proposableQualifiableDate.logicalDate,
      files: [
        .objects.object | if type == "array" then . else [.] end |
        .[] |
        (.file | if type == "array" then . else [.] end)
      ] | flatten | map(select(.))
   }' 1.sort.json 2.sort.json > documerica.jsonl

The files filter is the only really confusing one here. To break it down:

  1. We take the value of objects.object, and normalize it into an array if it isn’t already one
  2. For each object, we get its file, normalizing that into an array if it isn’t already one
  3. We flatten our newly minted array-of-arrays, and filter down to non-null values

The end result is a JSONL stream, each record of which looks like:

{
  "naid": "542494",
  "title": "DISCARDED PESTICIDE CANS",
  "author": "Daniels, Gene, photographer",
  "date": "1972-05-01T00:00:00",
  "files": [
    {
      "@mime": "image/gif",
      "@name": "01-0236a.gif",
      "@path": "content/arcmedia/media/images/1/3/01-0236a.gif",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/OpaAPI/media/542494/content/arcmedia/media/images/1/3/01-0236a.gif"
    },
    {
      "@mime": "image/jpeg",
      "@name": "412-DA-00001_01-0236M.jpg",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
    },
    {
      "@mime": "image/tiff",
      "@name": "412-DA-00001_01-0236M.TIF",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
      "@type": "archival",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
    },
    {
      "@mime": "image/jpeg",
      "@name": "412-DA-00001_01-0236M.jpg",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg",
      "@type": "primary",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.jpg"
    },
    {
      "@mime": "image/tiff",
      "@name": "412-DA-00001_01-0236M.TIF",
      "@path": "/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF",
      "@type": "archival",
      "@url": "https://catalog.archives.gov/catalogmedia/lz/stillpix/412-da/412-DA-00001_01-0236M.TIF"
    }
  ]
}

In the process, I discovered that 69 of the 15,992 records made available online don’t have any photographic scans associated with them. I figure that this is probably a human error that happened during archiving/digitalization, so I sent an email to NARA asking them to check. Hopefully they respond!

For the curious, there are the NAIDs for the records that are missing photographic scans:

    545393 551930 552926
552939 552954 553031
553033 553864 556371
558412 558413 558415
558416 558417 558419
558420 558421 558422
558423 558424 558425
558426 558427 558428
558429 558430 558431
558432 558433 558434
558435 558436 558437
558438 558439 558440
558441 558442 558443
558444 558445 558446
558447 558448 558449
558450 558451 558452
558453 558454 558455
558456 558457 558458
558459 558460 558461
558462 558463 558464
558465 558466 558467
558468 558469 558470
558472 558473 558474
  

Database-ification

I love JSONL as a format, but it isn’t ideal for keeping track of a Twitter bot’s state (e.g., making sure we don’t tweet the same picture twice).

Despite doing hobby and open-source programming for over a decade at this point, I haven’t used a (relational) database in a personal project ever2. So now seemed like a good time to do so.

The schema ended up being nice and simple:

CREATE TABLE documerica (
    naid INTEGER UNIQUE NOT NULL, -- the National Archives ID for this photo
    title TEXT,                   -- the photo's title/description
    author TEXT NOT NULL,         -- the photo's author
    created TEXT,                 -- the ISO8301 date for the photo's creation
    url TEXT NOT NULL,            -- a URL to the photo, as a JPEG
    tweeted INTEGER NOT NULL      -- whether or not the photo has already been tweeted
)

(It’s SQLite, so there’s no proper date/time types to use for created).

The full script that builds the DB from documerica.jsonl is here.

The bot

Technology wise, there’s nothing too interesting about the bot: it uses python-twitter for Twitter API access.

I ran into only one hiccup, with the default media upload technique: it seems as though python-twitter’s PostUpdate API runs afoul of Twitter’s media chunking expectations somewhere. My hack fix is to use UploadMediaSimple to upload the picture first, and then add it to the tweet. As a bonus, this makes it much easier to add alt text!

media_id = api.UploadMediaSimple(io)
api.PostMediaMetadata(media_id, photo["title"])
api.PostUpdate(
    f"{tweet}\n\n{archives_url(photo['naid'])}",
    media=media_id,
)

That’s just about the only thing interesting about the bot’s source. You can, of course, read it for yourself.

Wrapup

At the end of the day, here’s what it looks like:

This was a nice short project that exercised a few skills that I’ve used less often in the last few years. Some key takeaways for me:

  • My aversion to databases in personal projects is largely unfounded: turning the DOCUMERICA dataset into a small SQLite DB gave me exactly the kind of query and state control that I needed. No more hacky state files, unlike my other Twitter bots.
  • I can write small jq filters from memory, but my understanding of the filter language breaks down at anything more complicated than map and select. It’s a fantastic tool that saved me from writing a much messier transformation script in a more general-purpose language, so I should make an effort to become even more familiar with it.
  • Writing a Twitter bot continues to be easy; signing up for one continues to be painful: Twitter goes out of their way to make it difficult to honestly create multiple accounts, and doesn’t provide a separate account flow for bots despite explicitly acknowledging that bots improve the Twitter ecosystem3.
  • The National Archives is an incredible resource, and their public APIs are remarkably good and well-featured for an archival institution. Someone whose skills are better aligned towards this than me could probably find (and automate) some pretty incredible things using their APIs.

  1. Or perhaps not, as the subsequent jq wrangling in this post will demonstrate. 

  2. In case this surprises you: I use them regularly at work. 

  3. I acknowledge that it’s hard to prevent abuse if you encourage more people to write bots. But there are lots of technical solutions to this, like having the bot signup flow require connection to a human’s Twitter account. 


Reddit discussion

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK