12

Github Deprecate the authors field by pietroalbini · Pull Request #3052 · rust-l...

 3 years ago
source link: https://github.com/rust-lang/rfcs/pull/3052
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Member

pietroalbini commented 9 days ago

This RFC proposes to deprecate the package.authors field of Cargo.toml. This also implies preventing Cargo from auto-filling it, allowing crates to be published to crates.io without the field being present, and avoiding displaying its contents on the crates.io and docs.rs UI.

Rendered

`cargo init` will stop pre-populating the field when running the command, and

it will not include the field at all in the default `Cargo.toml`. Crate authors

will still be able to manually include the field before publishing if they so

choose, even though Cargo will warn when trying to publish those crates.

rylev 9 days ago

Contributor

I'm not fully sold that a warning is necessary. If it's not populated by default, what's the issue with someone adding it if they want to?

pietroalbini 9 days ago

edited

Author

Member

The main purpose I see for a warning is telling people who ran cargo init/cargo new before this RFC that they can actually remove the field if they so choose. Also, if the goal is to deprecate the field we should eventually have people stop using it.

Lokathor 9 days ago

Contributor

I like the authors field and would not appreciate it going away.

It's fair to make it not required, but it shouldn't actually be removed entirely.

XAMPPRocky 9 days ago

Member

Repeating from the Zulip. I'm not sure I see the need for the deprecation. This field has valid uses in many areas as people have pointed out, if crates.io doesn't want to display it they don't have to, but I don't see why that requires deprecation on Cargo's side, rather than solely making it optional. crates.io could stop displaying the information already even without this RFC.

pietroalbini 9 days ago

Author

Member

The [package] table of Cargo.toml only contains metadata used by Cargo or registries, and I'd like to see it remain that way (by deprecating and eventually removing through editions). Ultimately that's not my call to make though, and I'm curious what the Cargo team thinks about it.

ashleygwilliams 9 days ago

edited

Member

I am a fan of deprecation. The fields of a manifest file in a package management system are precious, and their numbers tend to trend towards infinite if there's not a clear effort to keep them tidy and minimal (see also package.json). The harm here may not be obvious; many fields can all have many useful reasons to exist. But for new and existing users alike, reading existing manifest files and creating new ones, an excess of optional fields is at best overwhelming and at worst actively confusing. The optionality of a field is not local to the individual manifest file and a name like "author" sounds... authoritative. In my opinion, the benefit of keeping this field doesn't outweigh the cost of the longterm maintenance (which will be costly on several axis). I also think this is a great, low-risk opportunity to develop a workflow and practice for deprecating manifest fields, which will be critical for the future ergonomics of crates.io, cargo, and devtools that leverage Cargo.toml.

kennytm 7 days ago

Member

the benefit of keeping this field doesn't outweigh the cost of the longterm maintenance (which will be costly on several axis).

The action with least maintenance is to entirely remove all code involving the authors field, and treating it as an unknown field (impossible due to $CARGO_PKG_AUTHORS but let's put this aside).

Currently, both cargo and crates.io ignores these unknown fields in [package]. So, at least on the axis of code maintenance, actively throwing a deprecation warning is more costly than doing nothing. Unless rust-lang/cargo#3576 is implemented.

Aloso commented 9 days ago

What's the reason why authors of a crate can't be renamed or removed after the crate was published? Is the reason technical (e.g. because of changing hashes) or social?

Member

Author

pietroalbini commented 9 days ago

What's the reason why authors of a crate can't be renamed or removed after the crate was published? Is the reason technical (e.g. because of changing hashes) or social?

The reason is technical: that metadata is stored in the Cargo.toml, which is part of the crate tarball. Updating it would change the hash, breaking immutability and more importantly breaking all the projects with a Cargo.lock.

SOF3 left a comment

You're taking two steps here. Making the authors field unnecessary and deprecating it are totally different things. Is it possible to do this step by step?

I understand that you want to remove authors completely to avoid crates.io maintenance issues in the future, which this problem cannot be replaced by merely removing the field. But this change is too aggressive without sufficient compensating benefit, at least in my opinion.

their name from the Internet, and the crates.io team doesn't have any way to

address that at the moment except for deleting the affected crates or versions

altogether. We don't do that lightly, but there were a few cases where we were

forced to do so.

SOF3 9 days ago

Is it really justified that we conduct a major change just for a minor use case that happens very rarely?

SOF3 9 days ago

This also only removes their name from the internet, but not the contents they created. Is this really meaningful in that sense? In particular, what if for exmaple, their names for some reason got into the code section of another person's crate?

pietroalbini 9 days ago

Author

Member

Is it really justified that we conduct a major change just for a minor use case that happens very rarely?

One of the things I value the most is the personal safety of every Rust user. I strongly believe changes like this are justified if they can prevent people from being harmed.

This also only removes their name from the internet, but not the contents they created. Is this really meaningful in that sense? In particular, what if for exmaple, their names for some reason got into the code section of another person's crate?

This is anecdotal evidence, but I have had access to [email protected] for almost two years, and all of the cases where personal information needed to be deleted were related to package.authors, not the source code of the crates. Of course we can't prevent people from intentionally adding their name in the source code, but not forcing them to do so will address most of the issues.

The contents of the field also tend to scale poorly as the size of a project

grows, with projects either making the field useless by just stating "The

$PROJECT developers" or only naming the original authors without mentioning

other major contributors.

SOF3 9 days ago

What if we look at it from another way? Authors is not for accreditation, but for contacting a maintainer. In that case what if we just rename authors to maintainer/contact?

pietroalbini 9 days ago

Author

Member

That's not the main reason why I'd like for this RFC to land. It's another effect that I personally think is positive, but it's more of a collateral benefit.

Nemo157 3 days ago

Member

If the authors want to be contactable they can provide contact details in the description/readme/homepage still (I assume most maintainers will want to be contacted via their projects issue tracker, not random emails).

published versions: this is highly desirable to ensure working builds don't

break in the future, but it also has the unfortunate side-effect of preventing

people from updating the list of crate authors defined in `Cargo.toml`'s

`package.authors` field.

SOF3 9 days ago

Is it really not possible to redact their names from existing packages? The only real use case for package.authors is env!("CARGO_PKG_AUTHORS"). Could anyone conduct a research to study how often this is actually used? Even if they are used, redacting a field from an existing package is unlikely to cause any issues unless, for some reason, a certain crate fails to compile without having a : in $CARGO_PKG_AUTHORS, or unless the crate tries to encode some logic inside the authors field. (This is hilarious, but I have actually seen the latter done in another community by someone who doesn't want his software to be "stolen" by forking + changing author name)

pietroalbini 9 days ago

Author

Member

Changing the contents of a crate will invalidate its hash, which will prevent any person depending on the crate from building their code.

nagisa 3 days ago

Contributor

The only other use-case I know is for listing maintainers of separate crates in large internal workspaces, but that can be easily achieved in some other way.

Cargo currently provides author information to the crate via

`CARGO_PKG_AUTHORS`, and some crates (such as `clap`) use this information.

Deprecating the authors field will require crates currently using it to change,

such as by inlining the author data.

SOF3 9 days ago

What is the expected impact in the long term? If it is eventually removed, will the BC for current packages using $CARGO_PKG_AUTHORS be broken?

If we don't intend to remove it in the long term, why deprecate (instead of remove) it at all?

pietroalbini 9 days ago

Author

Member

How to remove the field in the future is left as a future possibility. An approach I could see working is using the edition mechanism, but I think that's out of scope for this RFC.

The API will continue returning the `authors` field in every endpoint which

currently includes it, but the field will always be empty (even if the crate

author manually adds data to it). The database dumps will also stop including

the field.

SOF3 9 days ago

This seems to cause a superset of the problems caused by redacting authors in existing versions upon author's explicit request. Are you sure this is justified?

pietroalbini 9 days ago

Author

Member

As far as I'm aware there is no documented API endpoint that exposes the authorship information, and the database dumps are clearly marked as "experimental". Removing the information from there will mean we can delete it from the crates.io database.

`cargo init`, and it will not include the field in the default template for

`Cargo.toml`. Cargo will also treat the field as deprecated, eventually

displaying a deprecation warning when someone tries to publish a crate with the

field set.

SOF3 9 days ago

Plus, this no longer requires the $USER variable to be set in cargo new. This is actually good news for docker image maintainers.

SOF3 commented 9 days ago

What's the reason why authors of a crate can't be renamed or removed after the crate was published? Is the reason technical (e.g. because of changing hashes) or social?

The reason is technical: that metadata is stored in the Cargo.toml, which is part of the crate tarball. Updating it would change the hash, breaking immutability and more importantly breaking all the projects with a Cargo.lock.

Where is the hash used? Is it guaranteed to be stable such that external tools may depend on it?

Member

Author

pietroalbini commented 9 days ago

Where is the hash used? Is it guaranteed to be stable such that external tools may depend on it?

The hash of the crate is used by Cargo to ensure dependencies were not tampered with. If any hash in Cargo.lock does not match, Cargo will refuse to start any build.

SOF3 commented 9 days ago

Have you considered integrating this with the edition system, such that we specify that the hash may be mutable for edition 2021 crates? Since edition 2021 crates cannot be compiled by older rust toolchains anyway, this is unlikely to cause issues.

I heard we're not continuing with editions though, so this may not be a good idea.

Contributor

clarfonthey commented 9 days ago

I'm very for this change and folks who really want it can add an AUTHORS.md file.

But, one potential alternative could be to categorise certain metadata as being excluded from package versions entirely and updated separately, and I see that as a valid extension of this. For example, it might be nice to be able to update the maintenance status badge without pushing a new version, or fix a typo in the description.

Member

Author

pietroalbini commented 9 days ago

Have you considered integrating this with the edition system, such that we specify that the hash may be mutable for edition 2021 crates? Since edition 2021 crates cannot be compiled by older rust toolchains anyway, this is unlikely to cause issues.

Allowing the registry to alter the contents of the published crates without Cargo preventing builds would remove the immutability guarantee we currently have, and it would make reproducible builds way harder if not impossible to achieve. To me it seems like that approach would cause much more fallout than this RFC.

But, one potential alternative could be to categorise certain metadata as being excluded from package versions entirely and updated separately, and I see that as a valid extension of this. For example, it might be nice to be able to update the maintenance status badge without pushing a new version, or fix a typo in the description.

That's a really interesting idea! I definitely see the appeal of storing the metadata somewhere else, but that is going to require a lot of design work to get it right. As an example, I could see the maintenance badge to be "versionless" metadata while the description to be still tied to each individual version.

Even if we implement that, we'll need this RFC or an equivalent of it to land in order to remove that metadata from the Cargo.toml, so I see it more like a future possibility. I can add it there if you want!

Contributor

Lokathor commented 9 days ago

Yes, please add that as a future possibility.

Member

joshtriplett commented 9 days ago

With my Cargo team hat on (though not speaking for the rest of the Cargo team), this seems reasonable to me, and thank you for clearly laying out the rationale.

I suspect the most notable transition difficulty will be for crates that are currently relying on CARGO_PKG_AUTHORS, such as those using clap or similar.

I'd like to see a note in the RFC proposing guidance for how such crates should proceed. Crates that currently read CARGO_PKG_AUTHORS will need to handle it not being present. Crates relying on dependencies to read CARGO_PKG_AUTHORS will need to make use of some other mechanism to specify the authors (such as directly in the source), assuming they still wish to do so.

Otherwise, this looks good to me. Nominating for discussion in the next Cargo meeting.

lu-zero commented 9 days ago

edited

I think would be better to make author default to "Project Authors" and defer to the version control or other means to identify the authors.

Please note that most licenses expect the information to be present in a way or another.

Contributor

kornelski commented 9 days ago

edited

I agree with the problem it causes and would like to see a fix for this.

There is another problem with this field: it has no clear connection with crate owners, so reconciliation of authors and owners (to display both as a single list) is difficult and error-prone. It requires having a database of email-GitHub mappings, and for team-owned crates it's not even possible to cross-check the two sets.

However, this field has some uses that don't have a replacement (yet?):

  1. It allows giving credit to collaborators and previous authors without giving them ownership of the crate and publishing rights. It allows crediting people and orgs that aren't GitHub entities. The field is ordered and filtered, unlike set of crate owners (e.g. CI publishing bot or rust-bus backup account is not an author). npm has collaborators field that is deal for this. Cargo's authors field is the closest equivalent.

  2. It can outlive GitHub accounts, so it can work as a backup contact information.

  3. Crates have author's name even when their GitHub account doesn't.

  4. It works as a historical record. crates-io database dumps contain ownership information, but it's only the latest state, not a changelog.

Of course in all these cases people should still have control over their personal info. So a solution that moves where this data is stored to make it mutable to me seems better than complete removal.

I see no issue with making the authors field optional. I am strongly opposed to the rest of this proposal. I see several serious problems with it, the following being among them:

  1. Making changes to cargo because of the limitations of a website somewhere makes zero sense to me. The functioning of the website has nothing to do with cargo itself. The appropriate place to propose a change is with the maintainers of the website.
  2. The checksum SHOULD change if the content changes. This is by design. Violating this design has serious security and compatibility implications, as infrastructure can no longer trust in the current guarantees assumed of the checksum. Changing the content of the list of authors is by definition a change to the crate, and changing this semantics has significant technical consequences.
  3. If you change the crate, you should bump the version number. Retroactively changing a crate for any reason and leaving the version number and/or the checksum the same is a violation of the social contract that has long been established in the software community. I feel that there is a serious ethical problem in violating this social contract.
  4. If there is a problem with how the checksum is used, the appropriate fix is to change how you use the checksum. Maybe you should be using a different identification mechanism. An RFC proposing to incorporate such a mechanism should be welcome, in my view, if it is helpful to the community.
  5. The concerns of an author regarding the content of a crate is not and should not be the purview of the cargo project to address. There may be a wide range of reasons someone would object to the content of a crate. Accounting for them all by modifying the features and requirements of cargo serves to complicate the manifest, not simplify it as is suggested by another poster. If an author no longer wants content to be available somewhere, the correct course of action is to request the content to be removed. Yes, this means the older version of the crate will no longer be available. That is precisely what the author is requesting. If crates.io wants to implement a mechanism that redirects requests to an alternative crate—or any other mechanism to accommodate the content change—then they have the freedom to do so. They should not have the freedom to make changes to the content of a crate and tell people it is the same crate. It is not the same crate.
  6. The author field serves the very important role of attribution. Providing proper attribution is a strongly respected value within the scientific community which serves a variety of purposes both ethical and utilitarian. The challenges of accurately recognizing authorship are not unique to software. They exist in other intellectual domains as well. I see no barrier to implementing additional ways of serving attribution (under the constraints of my comments above), but deprecating the author field is not the right way forward.

I am personally very sympathetic to the motivations for this proposal and am happy to discuss my POV on them out of channel. Despite my strong objections above, they should not in any way be construed as a judgment of the RFC author. Indeed, I thank the author for this valuable contribution and wish to encourage them to continue to participate. Likewise, I am grateful for the opportunity to express my opinion here and to be able to read the alternative points of view of others.

Contributor

kornelski commented 8 days ago

edited

@rljacobson You're missing that the crates-io website displays what is published and checksummed in the registry. While it could override what is in the registry, this wouldn't change the data for others. For example, https://lib.rs displays data from the registry, bypassing crates-io website wherever possible. The registry is promised to be immutable, and this is relied on by various tools, caches, and lockfiles, so the registry can't change. People's names are mutable, therefore they can't be in the immutable part of the registry.

After reading the comments of others above, I want to point out that we already have a mechanism for dealing with errors or mistakes in software, including those that maybe be very sensitive or ethically significant. What do we do when, for example, a serious security flaw is discovered in a widely used piece of software? What do we do when someone accidentally publishes proprietary information in violation of copyright with their code? If someone publishes the private information of hundreds of their company's customers? I do not see a compelling reason to make authorship a distinguished special case of these kinds of bugs.

Contributor

kornelski commented 8 days ago

Currently the registry has yanking for hiding insecure software, but it doesn't delete or change it. The crates-io registry sometimes deletes crates that are spam or subject to legal complaints, but this is meant to be rare occurrence.

@rljacobson You're missing that the crates-io website displays what is published and checksummed in the registry. While it could override what is in the registry, this wouldn't change the data for others.

This is a brute fact of reality regardless of the features we do or do not provide users. Once you decide to change you want to make a change to a crate, the original crate will not magically change on everyone's computer. And again, you are describing a problem with how crates.io functions, not a problem with cargo.

People's names are mutable, therefore they can't be in the immutable part of the registry.

But the data they published in the past is NOT mutable. And there is nothing anyone can do about that regardless of their desire.

What you CAN do is remove the effected crate. This is what we do with every other instance of content that has been published that we wish to no longer be available. That's what should happen here.

Contributor

kornelski commented 8 days ago

edited

But the data they published in the past is NOT mutable. And there is nothing anyone can do about that regardless of their desire.

Well, no. This is under discussion. There are people who desire to change their names in already-published crates, and we're discussing how to accommodate that.

I mean for crates that have already been published using existing metadata+checksum this is hard, but we can change Cargo/crates-io/registry so that names will be mutable going forward.

Like many things in computer science, this could be fixed with a layer of indirection - the immutable packages could contain a numeric identifiers for each author (so that there's an immutable record of who was credited) and a separate mutable identifier->name mapping.

But the data they published in the past is NOT mutable. And there is nothing anyone can do about that regardless of their desire.

Well, no. This is under discussion.

The fact that the crate was previously published is a fact of history. You can only change the present/future, not the past. Unless you invent a time machine, in which case I would like to invite you and your time machine over to dinner sometime soon. There are a few recent events of the last year I would like to address. :)

What is under discussion is whether we should allow an existing crate to be modified and still be considered the same crate. I answer with a strong and passionate no. I think we can and should accommodate name changes in another way.

Contributor

Lokathor commented 8 days ago

If there was, say, a security flaw in an old version of a crate: you'd yank the bad version and publish a new version. In the case of a dire enough problem a crate could even be deleted entirely, as was mentioned.

I'm sympathetic to a person wanting to change their name in the authors field, but I don't follow why they can't also use the "yank and publish an update" system we already have for other problems with published crate content.

Combined with removing the requirements for an authors field (which it seems that everyone so far agrees with) that should be satisfactory.

It occurs to me that the proposal does not do a good job of solving the problem it is meant to and may even make it worse.

The goal is to enable, hypothetically, Robert Jacobson (me) to change his name in all of his crates to Robert Lawrence. We can imagine there is a reason of personal security for me to want to do so. The best case scenario is if "Robert Jacobson” is removed from public repositories hosting my code and from projects depending on my code, the number of which might be significant (hypothetically). Changing “Robert Jacobson” to “Robert Lawrence” on crates.io while not changing the identity of the crate will only provide the new name to future downloads of the crate. Existing projects will not see the change.

On the other hand, pulling the crate and bumping the version number for the modified crate is likely to result in more projects depending on the crate to get the updated name, either because the crate isn’t cached or because the Cargo.toml always fetches the latest minor version (or whatever), forcing the download of the modified crate. This is the more desirable outcome.

As for removing the authors field, this will undoubtedly have the effect of a nonstandard AUTHORS.md file or something similar becoming common practice. Now my name is in some AUTHORS.md file, and I want to change it. What do I do? I am certainly not better off than if I was listed in the authors field in the Cargo.toml. In fact, I am arguably worse off, because there is not and cannot be a formal mechanism incorporated into the tooling to help me.

Member

Author

pietroalbini commented 8 days ago

edited

This is a lot of feedback, thanks y'all for commenting!

Please note that most licenses expect the information to be present in a way or another.

@lu-zero I don't think the authors field has any impact on licensing. Most of the licenses I'm aware of require the license text to be reproduced, so my non-lawyer understanding is that license = "SPDX" or the authors list don't have legal meaning.

It works as a historical record. crates-io database dumps contain ownership information, but it's only the latest state, not a changelog.

@kornelski crates.io actually stores ownership changes in its database, and I could see that being useful to track who can publish new releases. If you have an use for that kind of data we can eventually discuss including it in the database dumps.

Making changes to cargo because of the limitations of a website somewhere makes zero sense to me. The functioning of the website has nothing to do with cargo itself. The appropriate place to propose a change is with the maintainers of the website.

@rljacobson crates.io is part of the Rust project, and closely collaborates with the Cargo team. All of the features where Cargo interacts with a registry are developed in collaboration with the crates.io team, and this is one of those features!

The checksum SHOULD change if the content changes. This is by design. Violating this design has serious security and compatibility implications, as infrastructure can no longer trust in the current guarantees assumed of the checksum. Changing the content of the list of authors is by definition a change to the crate, and changing this semantics has significant technical consequences.
If you change the crate, you should bump the version number. Retroactively changing a crate for any reason and leaving the version number and/or the checksum the same is a violation of the social contract that has long been established in the software community. I feel that there is a serious ethical problem in violating this social contract.

I totally agree that the contents of published crates (and their checksum) should never change, that's out of the question. What I'm proposing in this RFC is that we deprecate including structured authorship information in the Cargo.toml, so that when changes on this inevitably happen we don't have to delete the crate and break every single user of that crate.

Changes to the authorship information don't affect reproducibility of builds, and don't change any security property of the crate. The contents of the field are basically freeform text, and there is nothing preventing me from publishing a crate under your name. The kind of authorship information that matter from a reliability and security perspective (who's allowed to publish new releases) is already tracked outside of Cargo.toml (and thus outside the checksum).

The concerns of an author regarding the content of a crate is not and should not be the purview of the cargo project to address. There may be a wide range of reasons someone would object to the content of a crate. Accounting for them all by modifying the features and requirements of cargo serves to complicate the manifest, not simplify it as is suggested by another poster. If an author no longer wants content to be available somewhere, the correct course of action is to request the content to be removed. Yes, this means the older version of the crate will no longer be available. That is precisely what the author is requesting. If crates.io wants to implement a mechanism that redirects requests to an alternative crate—or any other mechanism to accommodate the content change—then they have the freedom to do so. They should not have the freedom to make changes to the content of a crate and tell people it is the same crate. It is not the same crate.

Deleting crates can have an enormous impact on the ecosystem (as everyone saw when packages were deleted in other ecosystems). Thankfully we never had to delete a popular crate yet, but what would happen if we had to remove a crate a big chunk of the ecosystem depends on? That's surely going to become a problem in the future and I want to do everything I can to minimize the disruption it's going to cause. Lowering the amount of cases where we have to do that is the best tool at our disposal.

The author field serves the very important role of attribution. Providing proper attribution is a strongly respected value within the scientific community which serves a variety of purposes both ethical and utilitarian. The challenges of accurately recognizing authorship are not unique to software. They exist in other intellectual domains as well. I see no barrier to implementing additional ways of serving attribution (under the constraints of my comments above), but deprecating the author field is not the right way forward.

In my experience the authors field is practical only if there is one or a couple maintainers for a crate. It's not feasible, for example, to set authorship information for the Rust compiler in its Cargo.toml, as there are way too many contributors. That results in the field either being outdated (only listing the original maintainer) or defaulting to something like "The Rust Project Developers". Both of those approaches render the field useless in my opinion.

After reading the comments of others above, I want to point out that we already have a mechanism for dealing with errors or mistakes in software, including those that maybe be very sensitive or ethically significant. What do we do when, for example, a serious security flaw is discovered in a widely used piece of software? What do we do when someone accidentally publishes proprietary information in violation of copyright with their code? If someone publishes the private information of hundreds of their company's customers? I do not see a compelling reason to make authorship a distinguished special case of these kinds of bugs.

We have yanking for security vulnerabilities, which prevents new uses of the crate. For the other cases unfortunately we have to delete the crates, but that doesn't mean we shouldn't strive to delete the least amount of crates possible.

What is under discussion is whether we should allow an existing crate to be modified and still be considered the same crate. I answer with a strong and passionate no. I think we can and should accommodate name changes in another way.

What I'm proposing here is not to modify an existing crate, but to avoid including information that could need to be changed in the future.

As for removing the authors field, this will undoubtedly have the effect of a nonstandard AUTHORS.md file or something similar becoming common practice. Now my name is in some AUTHORS.md file, and I want to change it. What do I do? I am certainly not better off than if I was listed in the authors field in the Cargo.toml. In fact, I am arguably worse off, because there is not and cannot be a formal mechanism incorporated into the tooling to help me.

We can't prevent people from including their name in the source code of their crate, but as Ashley pointed out in another comment having the field in the Cargo.toml encourages people to fill it in when they're publishing a new crate and they're looking at what fields are available.

I'm sympathetic to a person wanting to change their name in the authors field, but I don't follow why they can't also use the "yank and publish an update" system we already have for other problems with published crate content.

@Lokathor unfortunately yanking can't prevent people from being harassed or doxxed. Both of those things happened multiple times already, and I want avoid crates.io becoming a tool to harass as much as possible.

Member

Kixiron commented 8 days ago

I feel like the vast majority of people publishing crates (and therefore consciously making their work public for all the world to see) either don't care about the name they put into the authors slot or actively want attrition for their work, so removing the authors field entirely seems to me a knee-jerk reaction to combat a niche desire. I have no problem with making the field optional for publishing to accommodate those who don't want their name on anything, but automatically filling it out is what I think most users desire and if it's simply made optional, those who care can remove the field and therefore not expose their information. This really seems like a genuinely good middle ground, those who want privacy can get it by simply deleting a line from their project while anyone who doesn't care has cargo keep functioning for them as normal (entering their names automatically, which is assumedly what they prefer). To further that Cargo could even offer an option in the config.toml file that toggles auto-authorship in the Cargo.toml, with it set to true (the default) it acts as it does now and fills in the authors field from the various assorted sources and if it's set to false it omits that particular line from the template that's generated with cargo new

Contributor

Diggsey commented 8 days ago

@Kixiron the problem is that doesn't actually help with the main issue raised in the RFC.

I think it would useful to have a separate file to store data which is not part of the crate, ie. metadata. This could include authors, description, and other information that is not code, nor related to the functioning of that code.

This information would be stored in the registry when you publish a crate, but could also be updated/removed at any time, via eg. cargo publish --metadata-only. We can then populate this metadata file by default, as it can easily be amended later if needed.

Member

Kixiron commented 8 days ago

That's not actually what the RFC is about though, that would be a tangential thing. The title of the RFC is Deprecate the authors field, and that seems to be too large of a reaction to a niche need, that kind of sweeping change affects everyone (the vast majority of whom want their names on their crates) for a minority's (very valid) concern

tl;dr: We should create a way to express authorship. Authorship is data that is mutable and cross-version. A per version, immutable, manifest file is not the correct way to express this data.

I've commented previously that I support deprecation, but I'd like to add that my support of deprecation is not a statement about wanting to deny expression of authorship.

Fundamentally, as others have stated previously in this thread, authorship, as is expressed by a name and contact information for a particularly person or set of persons is mutable data that exists, often, across versions of a crate, not a single version. As a result, I do not think that the immutable, per version, manifest file is the correct place to express this information. The simple fact that we made this mistake early and have existed with it for a long time doesn't convince me that we should suffer the mistake indefinitely into the future.

Will people respond poorly to this warning? Yes, I'd expect that. In general people default to reacting negatively to change. That being said, now is the time to get the community prepared and used to transitions like this. These types of changes only get harder from here as Rust explodes in popularity. Avoiding change is a type of strategy, but I think it is much wiser for the Rust community to learn how to manage change instead of hoping to avoid it, as ultimately change is inevitable. Long term, avoiding change makes the ecosystem harder for new folks to join as the complexity of the system grows without bound.

Should we hold off on this deprecation until a new way to express authorship is included as part of the RFC. Maybe. I'd be open to that for sure. But should we simply settle to make optional fields that we find are mismatched with their technical expression because it is the path of least resistance? No; I think such a decision will set precedent and haunt us long into the future and ultimately cost us in an area we care deeply about: learning curve.

Contributor

Lokathor commented 8 days ago

but if the author list of foocrate 0.6 is different from the author list of foocrate 0.5, then the author list of foocrate 0.5 is still the same list. if you add or remove an author for the next version, they didn't retroactively work on the previous version.

but if the author list of foocrate 0.6 is different from the author list of foocrate 0.5, then the author list of foocrate 0.5 is still the same list. if you add or remove an author for the next version, they didn't retroactively work on the previous version.

@Lokathor in my experience it is not idiomatic or common to use the authors field as a way to acknowledge per release/version contributors .

Contributor

Lokathor commented 8 days ago

Well, that's how i actually use the authors field, and that's why it upsets me that you're proposing it be taken away from us.

Member

ashleygwilliams commented 8 days ago

edited

Well, that's how i actually use the authors field, and that's why it upsets me that you're proposing it be taken away from us.

@Lokathor I tried to do some research to see that usage in action because I would be interested in the details of that workflow but couldn't find anything. Could you share an example?

lahwran commented 7 days ago

edited

Chiming in to say i feel that this is an important change and that deprecation, if it happens, should be made from a philosophy of making backwards compatible changes when doing deprecation in fundamental data representations in which the deprecation is not due to security implications. i think that my biggest concern with this besides privacy is the fact that the authors field encourages misrepresentation of the authors of large projects. I think that is enough to motivate deprecation but that that motivation is not necessarily particularly urgent compared to the privacy and security issue of crate removal due to unexpectedly leaked data. my initial reaction on seeing this was to be quite concerned about the removal of an important form of metadata, but i've been convinced by reading this thread that the current representation is inappropriate enough to deprecate because of misrepresentation.

my ideal representation, which is probably too complex to actually use due to the external dependency on version control, would be a mapping from contributor id in the version control to a real life identity on the internet such as a github username - such that the actual commits could have relatively little user metadata and no author metadata is present in the file tree. that provides the mutability motivating this rfc, moves the mapping to an external system while still providing consistent identities, and more accurately represents to users how public they are making information when they make it available to the internet. However, I expect it to end up being concluded by the community that it is inappropriate to rely on version control for the author list due to not wanting to rely on any specific version control system, and I don't know what i would suggest to do instead, because even an external structured authors.{toml,json,etc} file has the issue of contributor overhead keeping it up-to-date.

Member

Author

pietroalbini commented 7 days ago

Pushed some minor updates to the RFC text!

  • Clarified in the guide-level explaination that the deprecation warning will not happen right away. This was already mentioned in the reference-level explaination, but while re-reading the text I noticed it might not have been clear by reading just the guide-level explaination.
  • Mentioned the steps needed to update existing crates, as @joshtriplett requested.
  • Added storing mutable metadata separately as a future possibility, as initially proposed by @clarfonthey and reiterated by others.

Perhaps Cargo could remove the authors field on cargo publish to crates-io? It would probably have to remove it from Cargo.toml.orig too.

@kornelski that'd actually be a breaking change due code like env!("CARGO_PKG_AUTHORS"): this RFC does not break existing code, as the variable would still be present until the maintainer actively removes the field from existing crates. If Cargo or crates.io were to strip the field automatically code that worked before would stop working without any action on the maintainer's part.

Member

jtgeibel commented 7 days ago

edited

My biggest concern (motivating my support for some form of deprecation) is that currently, for the vast majority of the users of cargo who have ever published a crate (to crates.io or an alternative registry), they have had personal information automatically obtained from their environment and then incorporated into an immutable package. This is a liability for the entire ecosystem, not just the website. Some users will notice this when modifying Cargo.toml but few will take the time to consider potential future implications. We really should, I would argue must, make this opt-in.

An example of good opt-in behavior is git, which requires a name and email to be configured when building a commit. If the fields aren't configured, this is an error and git will auto-detect the user's information and even tell them exactly what command to run to fix things. I think cargo could do something similar. When publishing a crate, if auto-detected info is found in Cargo.toml, and the user has not yet opted in, then the situation is explained and they must decide to opt-in, remove, or modify the field.

For new crates, cargo new should stop including the field. (Addressing this mistake is what makes this complicated.)

This scheme shouldn't break CI for users, as the info on CI is unlikely to be listed in the authors field. Unfortunately, if someone on a team rarely publishes new releases, they won't be shown the prompt. Maybe this means we need to store the approvals server side instead of in the user's home directory, and eventually disallow publishing crates where an entry in authors has not been explicitly approved by someone during the deprecation period. And I say "someone" because I don't really see how we can verify that the right human approves each entry in the list. I do hope that we can rely on the community to remove unapproved entries from their Cargo.toml rather than opting-in for another person.

Contributor

Lokathor commented 7 days ago

@ashleygwilliams There's no specific workflow. I just sometimes remember to ask people if they want me to put their name in, and sometimes they say yes when I ask. However, I haven't collaborated with anyone who cared in quite a long time.

tinyvec is surely at least 50% not-me at this point because of all the people in the #black-magic community discord channel who have helped with huge swaths of that project. A true community effort. randomize in version 3 is 100% me, I blanked the repo and restarted it between v2 and v3. wide is like 95% me and then there was one other person who kept adding math functions they wanted for a while. bytemuck is like 95% me and then thomcc added an extra API they wanted that sorta vaguely fit the crate's theme. If any of those people asked to get on the authors list I'd of course put them in.

Older examples of crates I with more than one author listed would be gba, voladdress, and learn-gfx-hal.

Contributor

Diggsey commented 7 days ago

@jtgeibel that seems like a lot of extra complexity, and will break if the auto-detected author differs from what was populated for some unrelated reason.

How about this:

  • Cargo stops populating the authors field automatically on project creation.
  • authors field is still required for publishing, or at least it's a warning to omit it. The warning will detail how to opt-in/out of specifying authors.
  • authors field may be specified as the empty list to opt-out of publishing personal info.

This makes it no different than eg. the license field, where you have to make a decision before publishing, so we'll no longer be surprising people: if your personal information is on crates.io it's because you wanted it to be there.

Member

BurntSushi commented 7 days ago

@Diggsey I like that idea and it seems like it solves the main problem that motivated this RFC? It also seems very close to what git does.

@ashleygwilliams

Fundamentally, as others have stated previously in this thread, authorship, as is expressed by a name and contact information for a particularly person or set of persons is mutable data that exists, often, across versions of a crate, not a single version. As a result, I do not think that the immutable, per version, manifest file is the correct place to express this information.

I think that author information can be seen as both mutable (names and email addresses change, albeit rarely) and an artifact of specific versions of crates (new authors introduced on 0.(x+1) shouldn't be labeled as authors for 0.x). I don't know that one is more correct than the other to be honest.

The simple fact that we made this mistake early and have existed with it for a long time doesn't convince me that we should suffer the mistake indefinitely into the future.

I just think that if we're going to reverse course on an important piece of metadata then we should have a higher bar than "it was a mistake." In this case, the mistake is leading to real world problems, but there are other ways to solve those problems that don't involve deprecation and possible future removal of the authors field. If we resolved those problems in some other way, then I personally don't see a strong motivation for deprecating authors. There's an argument that it is semantically incorrect, but I don't actually agree with that completely.

Or perhaps we are not in agreement that, say, making the field optional/opt-in solves the main problem motivating the RFC? Maybe I'm missing something there.

Member

ashleygwilliams commented 7 days ago

edited

I think that author information can be seen as both mutable (names and email addresses change, albeit rarely) and an artifact of specific versions of crates (new authors introduced on 0.(x+1) shouldn't be labeled as authors for 0.x). I don't know that one is more correct than the other to be honest.

@BurntSushi I think even if this is true, the fact that this is both attribution as well as contact information means that it needs to be mutable and therefore is not well expressed by an immutable document. I may be using version 2 of a tool and need to contact the author who may have updated their email address several versions later. Here the authors field's technical constraints harms both parties.

I just think that if we're going to reverse course on an important piece of metadata then we should have a higher bar than "it was a mistake."

I agree that we should have a higher bar (and think that we do have a higher bar in this case). Perhaps, I am so convinced because of my privileged knowledge of having been on the crates.io support email for several years and having dealt with fields like this in my time at npm.

If we really wanted hard data, I think there's an opportunity to do analysis of crates.io to see usage of the field. I believe strongly that we'll see that most authors fields are never updated after initial commit/publish, or are updated exactly once to remove and generalize the information in the field. I'd need to write some scripts and run some queries to indicate this but that work seems useful for longterm analysis and future changes of the manifest document structure. I'd be happy to do some of that work but I do think that the crates.io team should likely own any of those research programs.

Member

Author

pietroalbini commented 7 days ago

If we really wanted hard data, I think there's an opportunity to do analysis of crates.io to see usage of the field. I believe strongly that we'll see that most authors fields are never updated after initial commit/publish, or are updated exactly once to remove and generalize the information in the field. I'd need to write some scripts and run some queries to indicate this but that work seems useful for longterm analysis and future changes of the manifest document structure. I'd be happy to do some of that work but I do think that the crates.io team should likely own any of those research programs.

I did some analysis (source code) on today's database dump, and 92.6% of the crates never changed the contents of the authors field after being published. 6.4% of the crates changed it only once, 0.8% of the crates changed it twice, and 0.2% of the crates changed it three or more times. cargo-update is the outlier, being the sole crate with 17 changes to the field.

That script also did some analysis to check whether the crates that changed the authors field only once made it "more general". I considered the field to be "general" if it contained author, developer, project or team. Out of the 3330 crates with a single change to the field, 291 (8.7%) made it more general, and 124 (3.7%) was already general.

Raw data of the analysis

Member

jtgeibel commented 6 days ago

edited

Awesome analysis @pietroalbini. This has me thinking of a few things:

  • cargo new is run only once per package. We can look at the first release (in chronological order) for each package. For any authors listed there, their opt-in status is unclear.
  • Any authors added after the first publish, chose how they wanted to represent their authorship. This isn't strictly an opt-in to store the information immutably in the registry, but it is far better than automatically including the information.
  • If an original author is not listed in the latest stable release, then their entry has been removed or modified.
    • Patch releases to older version may occasionally include their info, but for most future releases we don't need to worry about obtaining their opt-in for this crate.
    • If we care to, we can try to address this patch release edge case.
  • Your data shows a vast majority of packages have never changed this field. I expect most of them also have only a single author listed, and that it was likely auto-generated.
  • For packages maintained by multiple people, we could focus on an opt-in for just the original authors and do so only if they are still in the latest stable releases.

Maybe we can leverage this in some way to provide a better opt-in story for existing crates that wish to retain the authors field. In the single author scenario, cargo obtains their opt-in and this is stored server side. Users would verify or modify/remove their info once per crate.

For team projects, the team would be expected to obtain an opt-in from a small group of authors, or remove them from future releases. (Exact details of what this might look like TBD.)

Aside: I've used "crate" several times here where the correct term is package. It is tough because we use the wrong term many places across the community and in code (such as our database schema and URL scheme). Is there value in changing crate -> package in this RFC to be correct? In many contexts using the right word may be more confusing to readers, but maybe we should try to be more careful here.

Contributor

Lokathor commented 6 days ago

Aside Reply: I'd say that "package" is actually the wrong term, or at least the far less used term. This actually came up once in the community discord. No one at all knew what the newbie who had just read The Book was talking about when they kept saying "package". Not a single person used that term in their normal rust life. I did an informal ask around on Zulip and almost no one there used package either. As far as I know, The Book is essentially the only place you can find fhe term, and even then only like the once. So we should call stuff crates and just forget about the term "package".

Member

jtgeibel commented 6 days ago

edited

even if a user opts in, if they change their email address or their name, we'll have an issue.

I definitely agree with this. I think authors should be made optional and that eventually we should provide a migration path for all optional package-level fields to be managed by the website (and possibly a CLI similar to cargo owners). I've tried to decouple that into rust-lang/crates.io#3167, but maybe we do need to take a more holistic approach here.

I think we should fast track the parts that don't seem to be controversial, changing cargo init and making authors optional. We stop digging ourselves into a hole and then figure out how to manage crates created before that point. We will still have the problem that for many existing crates, users will continue to publish new releases with their PII by default.

But, my proposal's focus on opt-in only address the "by default" side of that. Even PII explicitly added to Cargo.toml could become a risk in the future. I guess I've been trying to find a way to make sure we notify affected users without eventually warning on every crate publish when authors is present. There has been some resistance to that aspect and even with a deprecation there are legit uses of authors. When it comes to the "general" descriptions these shouldn't be an issue and some crates will want to continue supporting CARGO_PKG_AUTHORS.

Taking a step back, I think we need to provide 2 off ramps. The preferred migration path is for users to move the field out of Cargo.toml (to the website where it can be modified or removed). The other ramp might be a replacement field/file (so that we can disallow authors in a future edition) or a new field to silence the warning and indicate the user is aware this method of providing the data is immutable.

Eh2406 commented 3 days ago

The Cargo Team discussed this briefly in our meeting today. None of us had read this thread before the meeting, tho I caught up before posting. We were supportive of the "non controversial subset" of Cargo's part of this. Making authors optional and not filled by cargo init, is very convincing. Making a warning on publish ( and --dry-run ) that everything is immutable and permanent seams reasonable, as is calling out the authors field. We felt somewhat hesitant about deprecating/removing it. After some discussion it was mostly connotation rather then denotation. We felt comfortable with it not being in the title of the RFC and having a line like "the Cargo and Crates.io teams can arrange for how to remove it at some point in the future without a foloup RFC."

This is my memory of our discussion, and may not be what we believe when we have read all the conversation.

Cargo will stop fetching the current user's name and email address when running

`cargo init`, and it will not include the field in the default template for

`Cargo.toml`. Cargo will also treat the field as deprecated, eventually

nagisa 3 days ago

Contributor

Does cargo currently warn when a non-existent field is present? Like say [package] does_not_exist = 124? If not, I don't think we should warn for author either. Alternatively we should start warning for all unknown or unused fields (which would probably be a change independent of this RFC)

Manishearth 3 days ago

Member

Cargo currently errors for nonexistant fields IIRC

tshepang 3 days ago

Contributor

cargo check emits a warning, but cargo publish just goes silent

pietroalbini 3 days ago

Author

Member

@nagisa if we decide to deprecate the field I still think cargo should warn when it's present, regardless of whether other fields are issuing warnings. The main goal of the warning would be to let the authors know that they can and probably should remove the authorship information they left in Cargo.toml.

lessless commented 3 days ago

edited

Removing the author field will rob people of being recognized for their input into open-source. And every contributor deserves it.

We should create a way to express authorship. Authorship is data that is mutable and cross-version. A per version, immutable, manifest file is not the correct way to express this data.

that's a nice idea and might be the most advanced in the field.

authors and maintainers can be different people. previous maintainers can step off and still can be authors because they
contributed XX% to a project.

including both those groups would be really inclusive

lf- commented 3 days ago

edited

One piece that absolutely must be considered in this is that the people who want to change their names and scrub all the traces of the old one off the internet do not know they are going to want to do that in the future when they put the authorship info in a Cargo.toml.

I am speaking from personal experience, being someone who has done such a change; denial and missing signs for years are a very common experience. I accidentally made it easier for myself by being uncomfortable with my birth name for a very long time (see "missing signs"), so I put it in very few places. It still is in some repos' commit histories due to some maintainers pedantic about the Signed-off-by tag, unfortunately, which is another problem effectively equivalent to the one discussed here. I don't think my experience is typical: many are more willing to put their name on things, so we must assume that they will do that and minimize the problems if they choose to retract it.

If we want to make a tool that is kind to trans people and other groups who have good reason to change their name and erase all traces of the previous one from the internet, with the constraint that people don't know they are in this group until they do, the only way to really accomplish this is to keep this information exclusively in mutable storage, out of git and out of an immutable package tarball.

Given this constraint, I would prefer if the field were at minimum made optional, not included by default in any automatically generated configs, and an equivalent mechanism offered, for example, managing this state server side in the database and downloading it alongside the crate tarball and caching it client side in tandem with the crate code itself. It then can be envisioned that a function could even be offered to erase/change one's own name on all crates.

One last thing, I imagine that if we still support CARGO_PKG_AUTHORS in any new scheme, that would lead to nonreproducible builds since it would make builds depend on this mutable data. I'm not sure what should be done about this however.

Member

Author

pietroalbini commented 3 days ago

Thanks for reporting the outcome of the Cargo team meeting @Eh2406!

We felt somewhat hesitant about deprecating/removing it. After some discussion it was mostly connotation rather then denotation. We felt comfortable with it not being in the title of the RFC and having a line like "the Cargo and Crates.io teams can arrange for how to remove it at some point in the future without a foloup RFC."

Hmm, my main worry with this approach is that saying "cargo and crates.io can arrange how to remove the field in the future" when both teams seem to have an intention to eventually do so is practically a deprecation, and if we're doing one we should be explicit about it. Since that's the most controversial part of this RFC I'd also personally prefer to just make a decision on it here, to avoid having the discussion again when we actually remove the fields.

One piece that absolutely must be considered in this is that the people who want to change their names and scrub all the traces of the old one off the internet do not know they are going to want to do that in the future when they put the authorship info in a Cargo.toml.

I am speaking from personal experience, being someone who has done such a change; denial and missing signs for years are a very common experience. I accidentally made it easier for myself by being uncomfortable with my birth name for a very long time (see "missing signs"), so I put it in very few places. It still is in some repos' commit histories due to some maintainers pedantic about the Signed-off-by tag, unfortunately, which is another problem effectively equivalent to the one discussed here. I don't think my experience is typical: many are more willing to put their name on things, so we must assume that they will do that and minimize the problems if they choose to retract it.

If we want to make a tool that is kind to trans people and other groups who have good reason to change their name and erase all traces of the previous one from the internet, with the constraint that people don't know they are in this group until they do, the only way to really accomplish this is to keep this information exclusively in mutable storage, out of git and out of an immutable package tarball.

I wholeheartedly agree with this statement.

Given this constraint, I would prefer if the field were at minimum made optional, not included by default in any automatically generated configs, and an equivalent mechanism offered, for example, managing this state server side in the database and downloading it alongside the crate tarball and caching it client side in tandem with the crate code itself. It then can be envisioned that a function could even be offered to erase/change one's own name on all crates.

What if instead of a name and email address, each author was identified by a randomly-generated unique ID? crates.io could then store a mapping from this to human-readable data.

One last thing, I imagine that if we still support CARGO_PKG_AUTHORS in any new scheme, that would lead to nonreproducible builds since it would make builds depend on this mutable data. I'm not sure what should be done about this however.

I think CARGO_PKG_AUTHORS should be deprecated. For compatibility, we can replace it with an empty string.

lf- commented yesterday

edited

What if instead of a name and email address, each author was identified by a randomly-generated unique ID? crates.io could then store a mapping from this to human-readable data.

This also works. I'm not too worried about the technical aspects as I know the Rust team can figure this out.

One last thing, I imagine that if we still support CARGO_PKG_AUTHORS in any new scheme, that would lead to nonreproducible builds since it would make builds depend on this mutable data. I'm not sure what should be done about this however.

I think CARGO_PKG_AUTHORS should be deprecated. For compatibility, we can replace it with an empty string.

For the use cases where it is presently used such as in help output from clap, having Cargo not provide this function will mean names get put into application source code instead and checked into git, most likely (as this is a feature they want to offer). This unfortunately keeps names in immutable storage, however there would be a lot fewer in immutable storage owned by the Rust project. Admittedly, if it needs to be added manually to clap(...), that will significantly decrease the number of apps including author names, but it will still be a problem and author names will still end up on crates.io and in git in the cases where people elect to add them.

I would argue that, in terms of the metric introduced above, removing the environment variable entirely would be a regression over a solution where Cargo provides mutable author name handling as a service to the tools that use it and adds a big warning that the builds are not reproducible if the feature is used.

I think the idea of having author information inside executables in the first place produces a fundamental reproducibility problem because it is mutable data and we are trying to make sure the executable output is immutable, and I do not know how we can achieve reproducibility while this information is there. I sincerely hope there is a "have your cake and eat it too" solution for this but I am not sure if it is possible.

What if instead of a name and email address, each author was identified by a randomly-generated unique ID? crates.io could then store a mapping from this to human-readable data.

This also works. I'm not too worried about the technical aspects as I know the Rust team can figure this out.

One last thing, I imagine that if we still support CARGO_PKG_AUTHORS in any new scheme, that would lead to nonreproducible builds since it would make builds depend on this mutable data. I'm not sure what should be done about this however.

I think CARGO_PKG_AUTHORS should be deprecated. For compatibility, we can replace it with an empty string.

For the use cases where it is presently used such as in help output from clap, having Cargo not provide this function will mean names get put into application source code instead and checked into git, most likely (as this is a feature they want to offer). This unfortunately keeps names in immutable storage, however there would be a lot fewer in immutable storage owned by the Rust project. Admittedly, if it needs to be added manually to clap(...), that will significantly decrease the number of apps including author names, but it will still be a problem and author names will still end up on crates.io and in git in the cases where people elect to add them.

I would argue that, in terms of the metric introduced above, removing the environment variable entirely would be a regression over a solution where Cargo provides mutable author name handling as a service to the tools that use it and adds a big warning that the builds are not reproducible if the feature is used.

Good point.

I think the idea of having author information inside executables in the first place produces a fundamental reproducibility problem because it is mutable data and we are trying to make sure the executable output is immutable, and I do not know how we can achieve reproducibility while this information is there. I sincerely hope there is a "have your cake and eat it too" solution for this but I am not sure if it is possible.

The only one I can think of is for the executable to obtain the author name via a network request, which isn’t exactly practical.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK