4

doc: guarantee call order for sort_by_cached_key by digama0 · Pull Request #8962...

 2 years ago
source link: https://github.com/rust-lang/rust/pull/89621
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Copy link

Contributor

digama0 commented on Oct 7, 2021

edited

slice::sort_by_cached_key takes a caching function f: impl FnMut(&T) -> K, which means that the order that calls to the caching function are made is user-visible. This adds a clause to the documentation to promise the current behavior, which is that f is called on all elements of the slice from left to right, unless the slice has len < 2 in which case f is not called.

For example, this can be used to ensure that the following code is a correct way to involve the index of the element in the sort key:

let mut index = 0;
slice.sort_by_cached_key(|x| (my_key(index, x), index += 1).0);

Copy link

Collaborator

rust-highfive commented on Oct 7, 2021

r? @yaahc

(rust-highfive has picked a reviewer for you, use r? to override)

That code with index += 1 is something I hope to not find/see when I work on code written by other people. So if this change allows code like that, then I'd like to avoid this specification.

Copy link

Member

yaahc commented on Oct 7, 2021

That code with index += 1 is something I hope to not find/see when I work on code written by other people. So if this change allows code like that, then I'd like to avoid this specification.

Can you elaborate further on why that pattern concerns you?

Copy link

Member

yaahc commented on Oct 7, 2021

Marking this as needs-fcp since this documentation affects our stable API guarantees.

Copy link

leonardo-m commented on Oct 7, 2021

edited

Can you elaborate further on why that pattern concerns you?

In the last 10-15 years or so in the programming world people are finding that a more functional style leads to cleaner and less buggy code. While going total-pure-functional as in Haskell is still controversial (and with several disadvantages for system programming), reducing the amount of mutation (especially in situations like this, I'd say), is a healthy compromise between messy mutation-everywhere soup code and crystalline pure functional code, that leads to reasonably correct and reasonably handy to write code :)

Copy link

Member

yaahc commented on Oct 7, 2021

edited

Can you elaborate further on why that pattern concerns you?

In the last 10-15 years or so in the programming world people are finding that a more functional style leads to cleaner and less buggy code. While going total-pure-functional as in Haskell is still controversial (and with several disadvantages for system programming), reducing the amount of mutation (especially in situations like this, I'd say), is a healthy compromise between messy mutation-everywhere soup code and crystalline pure functional code, that leads to reasonably correct and reasonably handy to write code :)

While I don't disagree that functional-style code tends to be easier to reason about than imperative-style code and that code which is easier to reason about tends to be less buggy, I don't think that concern is specific enough that we can act upon it. One of Rust's core strengths is it's ability to help users write imperative style code to achieve performance that would otherwise be impossible with functional style programming via safe borrow-checked mutability. This function seems like a prime example of one that someone would reach for when they need to optimize performance.

If you have specific examples of how adding this guarantee could introduce major footguns then those will absolutely be considered, but I do not think we can base API decisions on a general desire to discourage imperative-style code.

Copy link

Contributor

Author

digama0 commented on Oct 7, 2021

I also want to note that if we really cared to discourage mutation in the closure, it should have been declared as Fn(&T) -> K instead of FnMut(&T) -> K. To me that signature says that mutation in the closure is an expected and intended use case, so documenting what happens when you do so seems like a natural part of the API here.

Obviously, the ship has sailed on changing the signature, Rust has stability guarantees and this is in the stable API.

Copy link

Member

yaahc commented on Oct 8, 2021

I cant think of any reasons we'd want to change the iteration order for how we setup the keys, so I see no reason not to commit to a stable iteration order and am happy to start the FCP.

@rfcbot merge

Copy link

rfcbot commented on Oct 8, 2021

edited

Team member @yaahc has proposed to merge this. The next step is review by the rest of the tagged team members:

Concerns:

Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

Copy link

Member

BurntSushi commented on Oct 12, 2021

I'm a little wary of committing to this change here. Not because I have any particular knowledge or insight that would lead me to believe we might not want this guarantee in the future, but rather, because it isn't totally obvious to me that this won't become restrictive in the future. I'm no expert on sorting algorithms, but every once in a while, there is some advancement there where someone comes a long and improves the state of the art. Are we absolutely positively sure that a guarantee like this won't inhibit anything there? I don't think I am...

I would also like to see some use cases for this guarantee. The example provided in this issue doesn't quite cut it for me. What is the actual problem being solved? What is the cost of not adding this guarantee?

Copy link

Contributor

Author

digama0 commented on Oct 12, 2021

edited

I think there are a few ways to weaken the guarantee while still keeping it useful. For example, it's not necessary to mention len < 2, and instead say that if f is called at all then it is called in order.

Still weaker, but relevant to the name sort_by_cached_key, is that f will never be called twice on the same key. I think that users expect this from the function since that's the defining difference from sort_by_key, and I can imagine initialization type setups that require this guarantee for correctness.

The main difference between sort_by_key and sort_by_cached_key is in the number and order of calls to the key function, so this is very much front and center of the API. sort_by_key is already able to be more vague about when and how many times the key function is called, but sort_by_cached_key could provide a more useful guarantee to users that want to use stateful key functions.

Regarding potential future sorting algorithms: If we grant that the key function must not be called more than once per key, then it is easy to prove that for len >= 2 a correct implementation must call f exactly once on every key, but it is still possible for the function to be called in reverse order, or in some permutation. The algorithm would have to choose the permutation in advance of seeing (most of) the keys, so it is difficult to imagine a genuine use for this outside some adversarial shuffling.

What is the cost of not adding this guarantee?

Without the guarantee, stateful key functions become a footgun, even though the function signature allows them. If the user doesn't know when or how many times the key function is called, the state of the closure can be in a large (exponential) number of possible states, at least according to the API, while actual testing will reveal that the key function is called only left-to-right, so it is easy to accidentally depend on this and not have the issue show up in testing or fuzzing, only to get bitten when the next version of std comes out. (I recall some cryptocurrency had an outage issue not too long ago with accidental dependence on binary_search API non-guarantees, so this sort of thing can happen in real life.)

Copy link

Member

yaahc commented on Oct 12, 2021

edited

Regarding potential future sorting algorithms: If we grant that the key function must not be called more than once per key, then it is easy to prove that for len >= 2 a correct implementation must call f exactly once on every key, but it is still possible for the function to be called in reverse order, or in some permutation. The algorithm would have to choose the permutation in advance of seeing (most of) the keys, so it is difficult to imagine a genuine use for this outside some adversarial shuffling.

Similarly, I'm having difficulty imagining any scenario where we optimize this function by somehow changing how we create the keys used by the sort. The way we cache the keys seems orthogonal to how the sort is then executed. Though I absolutely agree with @BurntSushi that I would like to have some better example usecases for this guarantee.

@rfcbot concern example-usecases

Copy link

Member

BurntSushi commented on Oct 12, 2021

edited

Yeah, basically, if we add this guarantee and something in the future comes up that wants more flexibility, we'll be able to say, "no we can't do that, because that will not only invalidate documented behavior, but that guarantee was made so that x, y and z use case would be able to work." Like, I think I'm convinced on a philosophical level---and maybe that is enough---but it would be good to make this concrete.

Copy link

Contributor

Author

digama0 commented on Oct 12, 2021

I think that, more or less by necessity, most use cases will have a similar flavor to the original example, and so may not be particularly compelling. Some more examples:

  • The cache key is generated by a cursor into a file (e.g. a line iterator). Seeking may either be unavailable, or available but expensive or difficult to implement, so the implementation just asserts if accessing an element out of order.
  • The slice itself is trivial (e.g. 0..n), but the cache key calculates a function of the iterates of a map g(f^i(x)) for some large object x that we don't want to keep many copies of. This allows us to sort an object "over time", without having to build all the copies up front. (For example, if we want to find the most populated time slices of an evolution of Conway's game of life.)
  • We want to involve A[i-1] in the sort key for A[i], for some kind of adjacent pair sorting problem. sort_by_cached_key isn't a perfect match for this problem, but it uses fewer extra allocations than a more purist approach involving an array of pairs.

Copy link

Member

BurntSushi commented on Oct 13, 2021

@digama0 So to clarify, here's what I originally asked for:

I would also like to see some use cases for this guarantee. The example provided in this issue doesn't quite cut it for me. What is the actual problem being solved? What is the cost of not adding this guarantee?

To elaborate a bit more on this, basically, what I would expect to see here are some approximately real code samples that show the use of this guarantee to solve a problem. The first example in your initial issue is phrased more-or-less abstractly: "this guarantee is useful when you want to use the guarantee for using the index while sorting." But it doesn't actually connect "using the index" to some real world problem. That's ideally what I'd like to see personally. Ideally, this would be something that came up in a real program. For example, what provoked you to open this issue in the first place?

Another important aspect of this is, once you have an example of using this guarantee, how would you solve the problem if the guarantee didn't exist? Sometimes the cost of absence is really small, and so maybe adding this guarantee doesn't buy you much. But maybe the cost of absence is really great (e.g., you'd have to write your own sort implementation), in which case, the guarantee would likely be more justified.

Basically, I always try to connect stuff like this to something in the "real" world.

Copy link

Contributor

Author

digama0 commented on Oct 13, 2021

edited

Ideally, this would be something that came up in a real program. For example, what provoked you to open this issue in the first place?

I'm going to have to disappoint you on this point; although I have known about sort_by_cached_key for a while, I've never found a place to use it in my own code. The real world issue that lead me to this PR was my quest for a stable slice partition; asking about it on zulip lead me to using sort_by with a boolean key, and so I started looking at the available variations on sorting a slice and noticed this wrinkle in the API documentation. In my case the sort key requires a lookup in an external data structure, so not quite as cheap as a field access, but in the end I decided against using caching.

Another important aspect of this is, once you have an example of using this guarantee, how would you solve the problem if the guarantee didn't exist? Sometimes the cost of absence is really small, and so maybe adding this guarantee doesn't buy you much. But maybe the cost of absence is really great (e.g., you'd have to write your own sort implementation), in which case, the guarantee would likely be more justified.

The cost of not using sort_by_cached_key is pretty clearly upper bounded by collecting the keys into a vec and sorting that, which can be done in about 2 lines, so I think this puts it in the category of a convenience function, and a tight API guarantee seems in keeping with this. I don't think the function really does anything that you can't do yourself almost as good, unless I missed some specialization magic in the standard library.

My point of view is that the primary advantage of this guarantee is not so much in the positives, where we gain the ability to do new things with mutating key functions, but rather in the removal of a sharp edge, where the API says that mutating key functions are allowed but also makes them unpredictable (but also reliable in practice, so as to lull you into a false sense of security). Even if we don't want the guarantee in this PR, the documentation should be edited to explicitly non-guarantee it. Right now the documentation says nothing one way or another and I had to pull up the code to take a guess at what actually happens.


Here is a (non-exhaustive) list of uses of sort_by_cached_key with a non-pure key caching function that I could find by eyeballing github.

  • [1], [2]: RNG. These types of uses will probably not notice if you switch out the order, but some uses of deterministic RNGs (e.g. in game replay) would like to avoid the order of calls getting switched around in new std versions.
  • [3]: calls a mutating trait method, import unclear
  • [4]: needs to mutate because it uses get_or_insert in a map
  • [5]: exposes an API that allows the caller to mutate
  • [6]: This one is interesting, because it ignores the array element, it is a pure mutating closure which looks a lot like the OP example. It might just be a cheap RNG though
  • [7]: the board estimator is allowed to mutate the board, although I didn't dig too deeply. It could be an instance of my game of life example
  • [8], [9]: not an FnMut, but the function does IO so the order of calls is externally visible. Calls to fs::metadata like this seem to be common.
  • [10]: The IO here is the log::debug! call; out of order calls to the key function will mix up the debug log. This is probably fine for this application, but it's probably easier to follow the log if calls are in order.

Copy link

Member

yaahc commented on Oct 22, 2021

@BurntSushi are you satisfied with this explanation? I feel like I am personally. I'd probably prefer to have at least one of these examples added as a ui test before proceeding, though it's not a big enough issue for me to block this over. But I want to make sure you're satisified before resolving my concern.

Copy link

Contributor

the8472 commented on Oct 27, 2021

edited

[4]: needs to mutate because it uses get_or_insert in a map

It doesn't seem to be order-dependent or have a problem with the method being called more than once since subsequent calls will just return the same value from the map.

Still weaker, but relevant to the name sort_by_cached_key, is that f will never be called twice on the same key.

That already is guaranteed. Assuming "same" here means by index, not by value equality.


A very hypothetical and handwavy optimization that I can come up with is computing keys lazily on the assumptions that
a) the key function is really expensive, which is why we're using cached_key in the first place
b) it is idempotent

With these assumptions something like [&foo, &foo, &foo, &bar, &foo].sort_by_cached_key(...) could conceivably be optimized by first sorting the raw values (e.g. by bitwise comparisons) and then doing a second pass with long runs of identical elements which (via idempotency) all share the same key and thus only need to be computed once.
Due to the first pass it would appear as if the key function were called out of order.

But since it's already being used for randomization assuming idempotency would already be problematic.

Copy link

Contributor

Author

digama0 commented on Oct 27, 2021

edited

Still weaker, but relevant to the name sort_by_cached_key, is that f will never be called twice on the same key.

That already is guaranteed. Assuming "same" here means by index, not by value equality.

Ah, so it is, my mistake. Although I would still prefer "at most once" over the more conversational but slightly ambiguous wording "only once" in the doc.

I share @BurntSushi's concern here about potential future restrictions of such a guarantee. I'd like to see if we can guarantee less and still be useful; notably, I'd be in favor of the "at most once" guarantee.

Copy link

Member

yaahc commented 13 days ago

Sorry for the late update on this, but we reviewed this PR during our libs-api team meeting on 2021/12/22 and concluded that we'd like to unblock this PR changing to the "at most once" guarantee mentioned by @joshtriplett.

Copy link

Contributor

Author

digama0 commented 13 days ago

Wording has been updated to remove the ordering guarantee and clarify the "at most once" part.

Copy link

Member

yaahc commented 12 days ago

@rfcbot resolve example-usecases

Copy link

rfcbot commented 12 days ago

bellThis is now entering its final comment period, as per the review above. bell

Copy link

rfcbot commented 2 days ago

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK