March 12, 2022 / #API

How I Built a Faster and More Reliable APOD API

Astronomy Picture of the Day (APOD) is like the universe’s Instagram account. It’s a website where a new awe-inspiring image of the universe has been posted every day since 1995.

As I was building a project using APOD’s official API, I found that requests would periodically time out, or take a surprisingly long time to return.

Curious and a bit confused (the data being returned was simple, shouldn’t require much computation, and should be easy to cache), I decided to poke around the API’s repo and see if I could find the cause, and perhaps even fix it.

Website as a Database

I was fascinated to find that there was no database. The API was parsing data out of the APOD website’s HTML using BeautifulSoup, live per request.

Then I remembered, this website was created in 1995. MySQL would have only been released mere weeks before the first APOD photo on June 16th.

ap950616, the first APOD

This wasn't great for performance though, as each day’s data that the API needed to return needed an additional network request to be fetched.

It also looked like requests for date ranges were made serially rather than in parallel, so asking for even a month of data took a long time to come back. And it took over half a minute for a year's data when it didn’t just time out or send back a server error instead.

womp womp

The official API also didn’t seem to do any caching – a request that took 30 seconds to load the first time would take another 30s to load the second time.

I believed that we could do better.

A faster and more reliable APOD API

Since I’m using APOD API to power a portfolio project (yes I’m job hunting 😛), I really need it to be reliable and load quickly. I decided to implement my own API.

You can find all the code in this GitHub repo if you want to look through it in detail as you read.

Here are the approaches I took:

1. Avoid on-demand scraping

One of the main reasons why NASA’s API response is slow is because data scraping and parsing happens live, adding a significant overhead to each request. We can separate the data extraction step from the handling of API requests.

I ended up writing a script to dump the website’s data into a single 12MB JSON file. Pretty chunky for a JSON file, but given that a free tier Vercel function can have an unzipped size of 250MB and has 1024MB of memory, it’s still small enough to be directly loaded without needing to bother with a database.

The script comprises of two parts:

getDataByDate(date: DateTime) is a function that, when given a particular date, will fetch the corresponding APOD webpage for that day, parse pieces of data out of the HTML using cheerio (JavaScript's equivalent to BeautifulSoup), and return structured data in the form of a JavaScript object.
extractData.ts, which calls getDataByDate with days from a date range (initially "every day between today and June 16th, 1995") using the async library's eachLimit method to make multiple requests in parallel. It stores each day's result as a separate JSON file on the filesystem, and finally combines all the daily JSON data into one single data.json.

You might wonder – why not fetch all the data first and save just one file at the end? When making 9000+ network requests, some of them are bound to fail, and you really don't want to have to start back from zero. Saving each day's data as it runs allows us to continue from where the failure happened.

Here's a comparison of timings before and after on-demand scraping:

Arguments

My APOD-API

NASA's APOD API

Average TTFB*
(n=20)

Standard
Deviation

Average TTFB
(n=20)

Standard
Deviation

no argument

110 ms

105 ms

start_date=2021-01-01
&end_date=2022-01-01

151 ms

35,358 ms

2,891 ms

count=100

9,701 ms

1,198 ms

*https://en.wikipedia.org/wiki/Time_to_first_byte

2. Fallback to on-demand data extraction

The extracted JSON will only have data up to the time when extraction was run. This means that sometimes there’ll be a new APOD that will be missing from our JSON. For those situations, it’d still be nice to fallback to live requests as a supplementary source of data.

In the code of our API request handler, we check our extracted data.json to find which date is the last date that we have data for, and if the number of days between the last date and today is greater than 1, we then fetch data for any missing dates in parallel (once again using getDataByDate, the same function we used for extracting data for the JSON file).

3. Aggressively cache requests

The bulk of time on APOD’s official API was spent waiting for the server to send the first byte. Since historical data doesn’t change, and new entries are added once a day, the actual application server doesn't need to be hit most of the time.

We can use headers to tell the Content Delivery Network (CDN) to aggressively cache the response of our cloud function. I’m hosting on Vercel, but this should work with Netlify and Cloudflare as well.

The code for the specific headers we want to send from the function handler is:

response
    .status(200)
    .setHeader(
        'Cache-Control',
        'max-age=0, ' +
        `s-maxage=${cacheDurationSeconds}, `+
        `stale-while-revalidate=${cacheDurationSeconds}`
    )

the above is a reformatted paraphrasing of the actual handler

Breaking that down,

max-age tells browsers how long to cache a request. If a request for a resource is within the max-age, the cached response would be used instead. We set max-age to 0, following Vercel's advice, to prevent browsers from caching API response locally. That way clients will still get new data as soon as it updates.
s-maxage tells servers how long to cache a request. So when a request for a resource is within the s-maxage, the server (in our case, Vercel's CDN) will send the cached response. This is really powerful since this cache is shared across all users and devices.
We set s-maxage to a variable amount of time, because for requests that ask for dates using a relative time ("today's data", or "the last 10 day's data"), we only want to ask the CDN to cache that for roughly an hour since that might update when the next APOD comes out. For requests that ask for a specific date's data (for example between "2001-01-01" and "2002-01-01"), we can ask the CDN to cache that for a lot longer, since that isn't expected to change.
We finally set a stale-while-revalidate header. That way, when the specified cache time expires, instead of having the next user wait until fresh data comes back, we tell the CDN to serve the cached data to the current user – but at the same time, hit our API endpoint for fresh data and cache that for the next request.

Since our API was loading all the data into memory already, the performance difference between cached vs. uncached requests shouldn’t be too noticeable, but faster is always better.

The main goal with caching is to avoid running the cloud function, since Vercel’s free tier has a quota of 100 GB-hours (not sure what that means, but whatever it is, I don’t want to hit it).

Comparison of timings before and after caching requests:

Arguments

My APOD-API

NASA's APOD API

Average TTFB
(n=20)

Standard
Deviation

Average TTFB
(n=20)

Standard
Deviation

no argument

105 ms

start_date=2021-01-01
&end_date=2022-01-01

35,358 ms

2,891 ms

count=100

9,701 ms

1,198 ms

4. (Bonus) Automated daily updates

We want to keep our data file in sync with NASA’s APOD website as much as possible, since reading data from our JSON file is much faster than falling back to fetching data over network.

Automating this doesn't exactly improve performance – I could set an alarm for myself to run the extraction script every night at midnight, commit any changes, and push to trigger a new deploy.

Thankfully, I won't have to, since apparently Github Actions aren't limited to running on Pull Requests, you can schedule them too.

name: Update Data Every 3 Hours

on:
  schedule:
    # At minute 15 past every 3rd hour.
    - cron: '15 */3 * * *'
  workflow_dispatch:

jobs:
  update-data:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-node@v2
        with:
          node-version: '16'
      - run: npm install
      - run: npm run update-data
      - name: Commit changes
        run: |
          if [ -n "$(git status --porcelain)" ]; then
            git config --global user.name 'your_username'
            git config --global user.email '[email protected]'
            git add .
            git commit -m "Automated data update"
            git push
          else
            echo "no changes";
          fi

Tip - https://crontab.guru/#15_*/3_*_*_*

Conclusion

In summary, where possible and sensible:

Extract data before requests are received and try to keep it up-to-date
Make fallback requests in parallel
Cache responses on the CDN

The code for all this is a bit too lengthy to fit into an article, but I believe these principles should be more broadly applicable for public-facing APIs (plenty more just on api.nasa.gov!). Feel free to peruse the repo to see how it all fits together.

Thank you for reading! I’d love to hear any feedback you may have. You can find me on Twitter @ellanan_ or on LinkedIn.

How I Built a Faster and More Reliable APOD API

How I Built a Faster and More Reliable APOD API

Website as a Database

A faster and more reliable APOD API

1. Avoid on-demand scraping

2. Fallback to on-demand data extraction

3. Aggressively cache requests

4. (Bonus) Automated daily updates

Conclusion

Recommend

ST华钰说明虚假陈述案最新进展情况

How to prevent npm install for unsupported Node.js versions

阿里再度加码回购至250亿美元！股价回到原点的阿里值得布局吗？

亚马逊将于4月18日开始运营克莱配送中心

Canalys：2021 年中国个人电脑市场出货增长 10%，达 5700 万台

腾讯音乐四季度付费用户续创新高寻求香港二次上市 | 财报见闻

Launch HN: Reality Defender (YC W22) – Deepfake Detection Platform

介绍两把锁：首款苹果 Home Key 门锁、首款华为鸿蒙智能门锁 | 深圳湾

钉钉总裁叶军：全面开放生态，推动数字技术服务千行百业

[New research] Subdomain takeovers are on the rise and are getting harder to mon...

About Joyk