Improve Duration::try_from_secs_f32/64 accuracy by directly processing exponent...
source link: https://github.com/rust-lang/rust/pull/90247
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Conversation
Closes: #90225
The methods now implement direct processing of exponent and mantissa, which should result in the best possible conversion accuracy (modulo truncation, i.e. float value of 19.995 ns will be represented as 19 ns).
(rust-highfive has picked a reviewer for you, use r? to override)
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
This comment has been hidden.
Outdated
@@ -777,7 +799,7 @@ impl Duration {
/// use std::time::Duration;
///
/// let dur = Duration::from_secs_f32(2.7);
/// assert_eq!(dur, Duration::new(2, 700_000_000));
/// assert_eq!(dur, Duration::new(2, 700_000_047));
Out of curiosity: are the extra 47ns because 2.7
does not have an exact f32
representation? The example now looks a bit unfortunate.
Yes, the closest f32
to 2.7
is 2.700000047683715..
. I think it's worth to have a clear demonstration of floats being "lossy", especially 32 bit ones.
Note for the author and reviewers that this proposed code, generate worse machine code; not by much but not negliable either.
EDIT: Although the code is theoretically slightly worse, in practice it's way better due to the removal of libm
calls. See #90247 (comment) for benchmarks.
Note for the author and reviewers that this proposed code, generate worse machine code; not by much but not negliable either.
Too bad we can't use f32::trunc
/f64::trunc
: Godbolt. Huh, the total cycles are higher than the hand-rolled trunc?
@Urgau
If I understand your link correctly, then the new version took less cycles to complete (1723 vs 3993). It's possible to improve performance by using u64
instead of u128
for the nanos
part, but it degrades conversion accuracy a bit.
Given that the original version involves an non-strength-reduced u128 division (cf. call to __udivti3
), its going to be very hard to make the new implementation worse. The benchmarks should reflect a significant speed increase in practice (though you may want to carefully pick an input which triggers the worst cases for either implementation).
@Urgau If I understand your link correctly, then the new version took less cycles to complete (1723 vs 3993). It's possible to improve performance by using
u64
instead ofu128
for thenanos
part, but it degrades conversion accuracy a bit.
Yeah, you're right I've just see that the number of instructions was worse and because there was more branches I assume the code will be worse, but because the "Total Cycles" is way better I expect better time.
Given that the original version involves an non-strength-reduced u128 division (cf. call to
__udivti3
), its going to be very hard to make the new implementation worse. The benchmarks should reflect a significant speed increase in practice (though you may want to carefully pick an input which triggers the worst cases for either implementation).
True. I did see them but didn't realize.
I've done some benchmark with the 3 implementations (before with libm, new and new with reordered match arms).
Here are the results:
Impl\Number -500 0.0 1.0 3.14e5 30.0 8000.0 85623.8745 99995412.2554122554
libm 26 154 231 293 272 254 361 452
frac 16 24 46 46 46 46 46 46
frac_reordered 17 30 46 46 46 46 46 46
Lower is better | Unit: ns/iter | Realized on an armv7 with hard-float
As @nagisa predicted, we can see a clear speedup with the new implementation. @newpavlov I would suggest you to revert your last commit ("reorder match arms") as it seems that it is slightly worse in two benchmark than the version without.
This comment has been hidden.
@@ -708,14 +708,28 @@ impl Duration {
/// as `f64`.
///
/// # Panics
/// This constructor will panic if `secs` is not finite, negative or overflows `Duration`.
/// This constructor will panic if `secs` is negative, overflows `Duration` or not finite.
Note that this change reflects the fact that negativity is now checked before anything else, meaning that -inf
and -NaN
are now a Negative
error instead of a NonFinite
error.
(the wording does imply that +inf
should instead be an Overflow
error instead of a NonFinite
one, only leaving NaN
to be a NonFinite
error)
The error kinds are not exposed publicly and only used for generating error descriptions, so I think the order of checks is fairly unimportant. The main reason why I have changed the wording was to fix the ambiguity: the previous variant can be read as both "is not (finite, negative or overflows)" and "is (not finite), negative or overflows".
This comment has been hidden.
This comment has been hidden.
changed the title Process fractional part separately in Duration::try_from_secs_f32/64
Improve Duration::try_from_secs_f32/64 precision by directly processing exponent and mantissa
Hi. Sorry for asking here directly but I am not quite sure about rust development workflow... is it possible to tell when this will be merged e.g. in nightly?
Once the review is finished and it gets an approval.
changed the title Improve Duration::try_from_secs_f32/64 precision by directly processing exponent and mantissa
Improve Duration::try_from_secs_f32/64 accuracy by directly processing exponent and mantissa
This comment has been hidden.
@nagisa
I have updated the code and now it uses the try_from_secs
macro as you have proposed. Instead of match
I've decided to use if else
branches since it makes code more compact, removes duplication of range edges, and makes the exclusive_range_pattern
unstable feature unnecessary.
Seems reasonable enough to me. Can you please squash away the fix commits and merges? I'll take another chance to read through the algorithm this evening and r+ then.
(0u64, 0u32)
} else if exp < 0 {
// the input is less than 1 second
let t = <$double_ty>::from(mant) << ($offset + exp);
On the choice of the offset
here, would I be right in saying that the $offset
could be as low as 30
(and shared between the implementations) without affecting the result of this computation in any way?
Just to clarify: I'm not suggesting to change anything here, and just trying to confirm my understanding. I think I understand how the current offset
values have been arrived at:
- for f32 –
offset = $double_ty::BITS - $mant_bits
; - for f64 –
offset = $double_ty::BITS - type_of(NANOS_PER_SEC)::BITS - $mant_bits
)
effectively making the current offset
values largest possible such that no intermediate operations could overflow the types, correct?
Offsets are specifically chosen so the final shifts will be equal to 64 and 96 bits respectively, allowing compiler to perform additional optimizations.
@bors r+
Thanks for seeing this through @newpavlov!
Commit e0bcf77 has been approved by nagisa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK