11

Downloading all the crates on crates.io

 4 years ago
source link: https://www.pietroalbini.org/blog/downloading-crates-io/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

There are a lot of reasons you might want to down­load all the crates ever uploaded to crates.io , Rust’s pack­age registry: code analy­sis across the whole public ecosys­tem, host­ing a mirror for your compa­ny, or count­less other ideas and projects.

The team behind crates.io receives a lot of support request asking what’s the best and least impact­ful way to do this, so here is a little guide on how to do that!

Getting a list of all the crates

crates.io offers multi­ple way to inter­act with its data: the crates.io-index GitHub repos­i­to­ry, exper­i­men­tal daily data­base dumps and the crates.io API.

The way I recom­mend to get the list of all the crates is to rely on the index: the exper­i­men­tal data­base dumps are more heavy­weight and are only updated daily, while usage of the API is governed by the crawlers policy (lim­it­ing you to one API call per second). If you absolutely need to use the API please talk with us by email­ing [email protected] , and we’ll figure out a solution.

The index is a git repository , and the format of its content is defined by RFC 2141 . There are crates such as crates-index that allow you to easily query its contents, and I recom­mend using them when­ever possible.

Downloading the packages

The best way to down­load the pack­ages is to fetch them directly from our CDN. Compared to call­ing the crates.io API, the CDN does not have rate limits and is faster (as the API redi­rects you to the CDN after updat­ing the down­load coun­t). The CDN URLs follow this pattern:

https://static.crates.io/crates/{name}/{name}-{version}.crate

For exam­ple, here is the link to down­load Serde 1.0.0 . Pack­ages are tar.gz files.

If you want to ensure the contents of the CDN were not tampered with you can verify the SHA256 check­sum of the file you down­loaded by compar­ing it with the cksum field in the index.

Keeping your local copy up to date

The best way to keep your local copy up to date is to fetch a fresh list of crates avail­able on crates.io and check if all of them are present in the local system, down­load­ing the ones you’re miss­ing. I person­ally recom­mend this approach as it’s less error-prone, and it heals your copy auto­mat­i­cally if for what­ever reason some of the changes are lost during a previ­ous update.

Another inter­est­ing approach you could imple­ment is to get the differ­ence since the last update of the index with git diff , pars­ing its output to get the list of crates that were added. There are also third-­party crates such as crates-index-diff that auto­mate this process for you. This approach is more frag­ile and error-prone, but it might be the only sensi­ble solu­tion if check­ing whether you down­loaded a crate or not is slow or expensive.

Common issues to be aware of

While the basics of down­load­ing the contents of crates.io are simple, there are a couple of issues to be aware of when imple­ment­ing such tooling:

  • The crates.io team strives to keep the registry as immutable as possi­ble, but we can’t always keep that promise. The tech­nol­ogy world does­n’t exist in a bubble, and there are laws every­one needs to abide to. Occa­sion­ally we receive take­down requests due to trade­mark or copy­right issues, and we have to remove the crates both from the registry and the CDN: your tool­ing should handle exist­ing crates disappearing.

  • To reduce the down­load size for cargo users we regu­larly squash the index repos­i­tory into a single commit, and start the git history from scratch. The previ­ous history is kept in a sepa­rate branch. To account for this we recom­mend running these commands to update the index:

    git fetch
    git reset --hard origin/master

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK