8

RFC: Serve crates-io registry over HTTP as static files

 3 years ago
source link: https://github.com/rust-lang/rfcs/pull/2789
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Conversation

Contributor

kornelski commented on Oct 19, 2019

edited

The existing crate index format is good enough to be served over HTTP as-is, and still fetched and updated incrementally.

This can be faster than a git clone of the whole index ahead of time, and will spare clients from unbounded growth of the index.

Originating forum thread.

Rendered RFC

Contributor

djc commented on Oct 19, 2019

It would be good to touch on offline use scenarios and how these (as well as slow links) would be affected by these changes.

Contributor

Author

kornelski commented on Oct 19, 2019

edited

I've added more info about network speed. To sum it up:

  • It's a significant saving of bandwidth compared to a full clone, even if it ends up downloading all crates anyway. For CI that runs once and doesn't keep the index, it'll be a huge reduction in bandwidth.

  • For a first run with no cache and no lockfile, the worst case is as slow as the max depth of the dependency tree (but in practice less than that thanks to crates appearing in multiple places in the tree). In my projects I've seen deps 12 levels deep, so on a high-latency connection it may take several seconds, but that still is likely way faster than a full index clone.

  • For subsequent runs, with some cache or a lockfile, the update can be speculatively parallelized to reduce latency to one or two roundtrips, because Cargo will already know about all or almost all dependencies-of-dependencies that it needs to check.

  • In terms of number of update requests, the worst case scenario without incremental changelog is a few hundred requests. It may seem like a lot, but thanks to web bloat, web servers regularly handle this much for typical page load without breaking a sweat :) These requests can be made conditional (using HTTP status 304), they should use very little bandwidth for updates. The incremental changelog can efficiently mark crates as fresh and not needed updates, so that in the best case an update will be just one HTTP request for the log itself.

  • Cargo has a fast path for GitHub that checks if the index is fresh in one HTTP request. This can be supported too for the HTTP index, so the fast path is just as fast.

Note that cost of the git index is proportional to total size of crates-io, but cost of this solution is proportional to the project size. At some point git clone will become too large to be usable, but this solution will remain functional (even large Rust projects don't use all crates-io dependencies at the same time, so they won't grow as big as fast).

Member

Nemo157 commented on Oct 19, 2019

One third-party tool that could not support a feature with this design is cargo add's - vs _ automatic distinction:

> cargo add serde-json
WARN: Added `serde_json` instead of `serde-json`

well, it still could be implemented by fetching every permutation, that would be O(n^2), but since n is going to generally be 1-2 that could be ok.

Member

matklad commented on Oct 21, 2019

I think it might make sense to spell out how Cargo would use a new index in some more details. Currently, it works as follows.

For every command that can change lockfile (most commonly, first cargo build after Cargo.toml change or cargo update) the first thing that Cargo does is updating the registry (that is, pulling new changes from the git repo). After that, Cargo has a fresh offline copy of the registry in a bare git checkout, and use it for actual dependency resolution.

The two interesting properties of status quo are:

  • if you change lockfile, you always use the most recent version of the index, so that security patches and other minor updates propagate as fast as possible
  • index changes are globally atomic: Cargo works with a set of packages, identified by a commit hash

What would we do in the new implementation? I think there's no explicit registry update anymore. Instead, we update registry on a per-crate basis lazily, when we do version resolution?

Contributor

Author

kornelski commented on Oct 21, 2019

edited

@matklad This is covered in "greedy fetch" section. It works as follows:

  1. Cargo still "updates the registry" as a separate step, but instead of fetching everything, it fetches a subset of the registry that is relevant for a specific workspace. At this stage it's not an actual precise dependency resolution, but a "greedy" dependency resolution in order to discover and update all dependencies that may be needed. The greedy fetch guarantees that all of the crates that the actual dependency resolution algorithm might look at will be up to date.

  2. After the incremental update, Cargo has a fresh offline copy of the registry, except crates that were not relevant for the update. It can run the regular dependency resolution as usual, and it won't know the difference.

The registry storage format would change from a bare git repo to a directory structure similar to a git checkout. Theoretically it'd be possible to mix the two and add "fake" commits to a local git repo to keep using it as the storage format, but I don't think that's worth the complexity.

Possible alternative that wouldn't require http2 support: Download index as zip, which could utilize github's download as zip functionality.

Contributor

Author

kornelski commented on Oct 24, 2019

edited

The zip would be good only for initial download (you wouldn't want to download 20MB+ on every cargo update), so it sounds like a variation of "Initial index from rustup" alternative in the RFC.

Zips don't have to be compressed (store "compression") and if github supports range requests one could only fetch the headers and the parts that changed, i.e. do an incremental download.

Contributor

Author

kornelski commented on Oct 24, 2019

edited

Oh, that's clever! So yes, a ZIP could be used to combine multiple requests into a two/three range requests (to find offsets of the ZIP central directory, to get the directory, to get ranges of desired files).

  • GitHub unfortunately doesn't support Range on their ZIPs, so crates-io would have to store the ZIP elsewhere.
  • Care would have to be taken to recover from race conditions (in case ZIP changed after client read the central directory)
  • If crates-io ever moved to storing crate metadata in a database, it'd be difficult to maintain a ZIP-like "API" on top of that. Plain HTTP/2 requests don't dictate how the index is stored server-side.

Currently the index requires registries to list all of their crates ahead of time. In a model where individual crates are requested directly by their name, a registry could even create crates on demand (lazily). That may be interesting for setups where Cargo registry is a facade for other code repositories.

remram44 commented on Oct 24, 2019

edited

You are describing something akin to zsync (or librsync), which interestingly I'm in the process of re-implementing in Rust thinking

If cargo hits http static files rather than a git repo wouldn't that make life a lot easier for private repositories like artifactory for example to support Rust? Seems like this would lower the barrier to supporting Rust in an enterprise setting so a thumps up from me.

Personal opinions, not speaking for the team.

This feels like a direction that could actually work! I have participated in many "just use HTTP" conversations in the past, this is the first time it has sounded plausible. (Sorry if you explained it to me before and I was to dance to get it.)

At some point between now and NPM this will be faster. That being said my sense is that the tipping point is far off. Additionally it is very important to read the line "Since alternative registries are stable, the git-based protocol is stable, and can't be removed." As such I will be spending my time on ways to improve the git-based protocol. This RFC is convincing enough that I would not object to someone spending there time on developing it.

On Performance:
This RFC assumes that a robust parallel HTTP/2 client will be maintainable and faster then a git fetch. I am skeptical by default. I would love to see examples showing that to be true. Async and await make thinks much ezere then it was last time I tried, and things continue to improve.

For Crates.io:
On CDN:
The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

For Alternative Registries:
I really like that this reduces the set of infrastructure required to set up an Alternative Registry to a static file server!

Should compression support be mandatory for clients? i.e. they must send an Accept-encoding: ... header? gzip, br, zstd, ...?

The Cargo team discussed this RFC over the past few weeks, and we wanted to leave some thoughts of what we discussed and what we're thinking about this RFC.

Overall none of us are opposed to the principles of this RFC or the idea it's chasing after. If there's a better solution for hosting and managing the index, that'd be great to implement! If it's better after all then it's better :)

Our main discussion point, though, was that implementing this style of fetching the index is likely to be a very significant undertaking from a technical perspective in Cargo. There's a ton of concerns to take care of and it's not quite as simple as "just throw HTTP/2 at the problem and let it solve everything". We'd ideally like to see a prototype implementation to prove out the possible gains that an architecture like this would bring us, but unfortunately a prototype implementation is likely such a significant undertaking that there needs to be more buy-in before even that happens.

The conclusion we reached is that we'd be willing to accept this under the idea that it's an "eRFC" or "experimental RFC". It would basically be a blessing that the Cargo team is interested in seeing research and development around this area of hosting the index and fetching it on the client. We aren't wed to any particular details yet, and the actual details of this RFC as-written are highly likely to change while a solution is being developed.

Contributor

Author

kornelski commented on Dec 9, 2019

edited

I've implemented a proof-of-concept with greedy resolution that takes into account features and version ranges.

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept

https://github.com/kornelski/cargo-static-registry-rfc-proof-of-concept/blob/master/src/main.rs

Tested with rocket framework, which required 122 requests:

  • If every request had added a whole 1 second of latency, over HTTP/1.1, it fetched all deps in 11 seconds
  • If every request had added 200ms latency, over HTTP/1.1: 3.5 seconds.
  • Without extra latency, over HTTP/2 on a fast connection: 1.6-3 seconds.

Tested on a fast connection with 144 crates.io dependencies used by servo. It required 344 requests:

  • over HTTP/1.1: 1.2-2s.
  • over HTTP/2: 0.7-1.3s.

As predicted, more dependencies make lookup faster, because lookup starts with more parallelism.

Just [email protected] - 8 requests:

  • over HTTP/1.1: 0.3s
  • over HTTP/2: 0.2s

All these tests were simulation of the worst case of a blank state with no cache.

Contributor

Author

kornelski commented on Dec 9, 2019

edited

I've added cache and revalidation support. Cache flattens the check's critical path to a depth of 1, and revalidation makes bandwidth use minimal, so this makes all checks (regardless of crate(s) checked) almost equally fast.

With disk cache and a fast connection:

  • HTTP/1: 0.8s
  • HTTP/2: 0.4s

With disk cache and a 3G connection:

  • HTTP/1: 1.6-2.4s
  • HTTP/2: 0.7s-2s

A problem this RFC doesn't seem to address with its local cache is deleting entries from the index: it's something we do very rarely, but sometimes we're legally required to do so (DMCA, GDPR...), and in those caches we can't leave the deleted entries in the users' local caches. The git index solves this cleanly (it's just yet another commit).

The Index is generously hosted by GitHub with a good CDN. We do not do things that will get it kicked out (aka shallow clone). Which side of that line is using raw.githubusercontent.com? I would like proactive assurance that switching will be ok. (I miss lolbench.rs and don't want that to happen to the Index.) Do we know someone at GitHub that can provide such assurance? If not can we get an estimate of the CDN bill? Switching may have to wait for that size bill to feel small.

If we do this we should just host the content on Rust infrastructure. We should still be on the 0.04$/GB CloudFront pricing tier, which is sustainable for those small files.

Contributor

Author

kornelski commented on Dec 9, 2019

Note that Cargo currently does not delete crate tarballs and checked out copies of crates even after they have been removed from the index. Deleted metadata files continue to be distributed as part of git history until the index branch is squashed.

With a static file solution, when a client checks freshness of a deleted crate, it will make a request to the server and notice a 404/410/451 HTTP status. It can then be made to act accordingly, and clean up local data (even tarball and source checkout).

If the client is not interested in deleted crate, it won't check it, but chances are it never did, and didn't download it. If ability to immediately erase deleted data is important, then the "incremental changelog" feature can be extended to notify about deletions proactively.

Member

alexcrichton commented 22 days ago

Thanks for taking the time to update this @kornelski! Additionally thanks for all the implementation and proof-of-concept work you've put into this @jonhoo, it's been invaluable for evaluating this!

I'm personally almost ready to start the FCP to merge this, but before doing so I wanted to maybe try to not specify as much of the technical design in this RFC but rather moreso describe high-level design and then details are "just an example". From my perspective, the purpose of this RFC is acknowledging that the git index will not scale to the future of crates.io and needs replacement. The replacement, and this RFC, is "let's serve each crate's index file over HTTP". The rest of the RFC is largely along the lines of "here's how this is practical".

@jonhoo's data shows that the git index is really good at what it was designed for, which is that after you do an initial clone everything is blazingly fast. Incremental updates are very fast with the git index. I think it's relatively obvious how if you fetch only a small fraction of the index that HTTP clearly beats git for the initial fetch, but the really tough part about HTTP (and what we wanted data for) is how it performs on other metrics. I think the majority of the design section of the RFC should be explaining how this tradeoff is worth it.

Given the initial data, I think the conclusion is that fetching the entire crate graph for a project over HTTP is so fast that we can just do it every time. The speed of the fetch, however, relies on rps and compression. Compression is a given with HTTP, but the rps is heavily reliant on caching and a CDN practially right now.

Overall I basically don't want to paint ourselves into a corner with how we implement an HTTP index. I would like Cargo to retain design freedom in this stage to figure out what works best for it and how best to implement the HTTP index. I'm pretty confident we will find a design that works (since the current PR is already very close to "justs works"), but this is why I don't want to specify too many technical details in this RFC. For example it's not clear to me why we'd uses /se/rd/serde for a fetch path instead of just /crates/serde (or similar). Similarly specifying a greedy fetch I don't think is what we'll want to do long-term, despite it being sufficient for now.

Eh2406 commented 22 days ago

I want to reiterate Alex's point of thanking you both, you have both done a lot of work to get us here!

Similarly specifying a greedy fetch I don't think is what we'll want to do long-term, despite it being sufficient for now.

I think the greedy fetch was important to show that this can be implemented in Cargo without massively rewriting everything, I would not have been willing to consider the RFC/PR without it. But as this code path gets to be more used, I am very willing to work to rewriting what is needed to make it more efficient. Alex has suggested a trick to integrate it into the existing resolver that may only be weeks of work! Even if that trick does not work out, I would expect that we would have something better in the next say 2 years.

Contributor

Author

kornelski commented 20 days ago

edited

@alexcrichton I've removed implementation details from the RFC and changed a bunch of "must"/"should" to "may".

I don't want to reduce the RFC to purely a high-level idea of having HTTP index, because a lot of the RFC answers concerns people have raised about feasibility of this solution.

Contributor

jonhoo commented 20 days ago

@Eh2406 I almost feel like Greedy Fetch should be in its own subsection called "Feasible Implementation Proof". I agree with @alexcrichton that I don't think we want the RFC to introduce a normative greedy fetch algorithm, but it is a useful data-point in demonstrating that the RFC can perform well. Maybe an "evidence" section that points to the performance of my experimental PR + benchmark results would be sufficient?

Member

alexcrichton commented 19 days ago

Ok thanks @kornelski! And sorry yeah I don't mean to say the RFC should be devoid of details, only that if the RFC is considered canon I don't want Cargo to accidentally be locked into implementing something we figure out a better solution to later.

I do think we'll need to lock down at least some details during stabilization (e.g. paths, configuration, etc), but details like greedy fetch, caching, fetch order, and such can probably never be fully defined. We'd then ideally have a set of guidelines for "here's how to be a fast index" and that could be a living document to get updated over time (and presumably crates.io would always implement the recommendations).

Contributor

jonhoo commented 14 days ago

@kornelski I submitted a PR that removes the greedy fetch, since I don't think we'd want that to be normative, and instead replaces it with a note on feasibility: kornelski#2. With that I think I'm happy with the RFC!

https://andre.arko.net/2014/03/28/the-new-rubygems-index-format/

Bundler used to have a full index fetched ahead of time, similar to Cargo's, until it grew too large. Then it used a centralized query API, until that became too problematic to support. Then it switched to an incrementally downloaded flat file index format similar to the solution proposed here.

joshtriplett 14 days ago

Member

More prior art: Debian has "pdiffs", where it provides incremental deltas from the complete "Packages" file for a certain period of time, so that apt update can download those incremental deltas if it isn't too far out of date. There have been several revisions to that approach, to minimize the number of diffs that an update must download.

kornelski 14 days ago

Author

Contributor

Is pdiff similar to zsync?

Member

alexcrichton commented 13 days ago

@rfcbot fcp merge

I think this is in a good state to merge now personally, thanks again @jonhoo and @kornelski. To recap for cc'd members here, this RFC is about declaring the git index as having a finite amount of time left before its scalability is a large problem. The RFC here proposes using HTTP instead for each crate's index file, but details about how exactly this all works are not specified in the RFC (but examples are given). @jonhoo's prototype work shows that this should be a very viable solution for the registry's index on crates.io and while the implementation work in Cargo needs more to be done we're already in a pretty good state.

I have also removed T-crates-io from this RFC since this is currently just related to Cargo really. We've discussed with the crates-io team that work may be upcoming in the future, but while this is implemented in Cargo and tested in the ecosystem the goal is to avoid any changes to crates.io itself.

rfcbot commented 13 days ago

edited by ehuss

Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged team members:

No concerns currently listed.

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot commented 10 days ago

bellThis is now entering its final comment period, as per the review above. bell

rfcbot commented 10 hours ago

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

The RFC will be merged soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Linked issues

Successfully merging this pull request may close these issues.

None yet

23 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK