3

Improved Map Performance

 3 years ago
source link: https://github.com/fsharp/fslang-suggestions/issues/940
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Improved Map Performance #940

krauthaufen opened this issue 10 days ago · 44 comments

Improved Map Performance #940

krauthaufen opened this issue 10 days ago · 44 comments

Comments

krauthaufen commented 10 days ago

edited

Improved Map Performance

I recently started a new Map<'Key, 'Value> implementation for FSharp.Data.Adaptive that uses subtyping and virtual calls instead of discriminated unions and pattern matching for representing the tree.

After a few hours it became clear that this implementation performs much better and so I deciced to start a discussion here.

So here's my repo containing the implementation, several tests and benchmarks. Note that I will likely add more combinators there and I don't expect all of it to be merged into FSharp.Core but I think F# shouldn't miss out on the potential speedup.

repo

Also note that the same technique should be applied to Set when merging this.

I'd also like to mention that the abstract data-type is still the AVL-like tree as in FSharp.Core today. The only difference is really how operations are implemented.

I would be very happy to help with integrating this into FSharp.Core and I'm of course open to suggestions.

Pros and Cons

The advantages of making this adjustment to F# are significant performance improvements.

The disadvantages of making this adjustment to F# are the slightly less readable code. Although the original Map implementation isn't exactly readable.

Extra information

Estimated cost (XS, S, M, L, XL, XXL): L

Related suggestions: (put links to related suggestions here)

Affidavit (please submit!)

Please tick this by placing a cross in the box:

  • This is not a question (e.g. like one you might ask on stackoverflow) and I have searched stackoverflow for discussions of this issue
  • I have searched both open and closed suggestions on this site and believe this is not a duplicate
  • This is not something which has obviously "already been decided" in previous versions of F#. If you're questioning a fundamental design decision that has obviously already been taken (e.g. "Make F# untyped") then please don't submit it.

Please tick all that apply:

  • This is not a breaking change to the F# language design
  • I or my company would be willing to help implement and/or test this

For Readers

If you would like to see this issue implemented, please click the +1 emoji on this issue. These counts are used to generally order the suggestions by engagement.

Collaborator

dsyme commented 9 days ago

Was this measured against FSharp.Core 5.0.0 or 4.7.2?

thanks!

Author

krauthaufen commented 9 days ago

TBH I'm not entirely sure which version these benchmarks ran on, so I did a quick rerun on my notebook (different CPU, only for count=10) ensuring that FSharp.Core >= 5.0.0 is used and the general behaviour seems to be similar. (running the benchmark with all counts on the other machine right now)

Method Count Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 10 765.045 ns 15.2483 ns 22.3507 ns 0.1974 - - 1240 B MapNew_ofArray 10 645.144 ns 11.9210 ns 16.7116 ns 0.1574 - - 992 B Map_toArray 10 194.027 ns 3.8896 ns 4.4792 ns 0.1056 0.0002 - 664 B MapNew_toArray 10 76.341 ns 0.5347 ns 0.4465 ns 0.0548 0.0001 - 344 B Map_enumerate 10 298.305 ns 3.7254 ns 3.4848 ns 0.1144 - - 720 B MapNew_enumerate 10 189.360 ns 2.4233 ns 2.2668 ns 0.0637 - - 400 B Map_containsKey_all 10 214.089 ns 2.5381 ns 2.3742 ns - - - - MapNew_containsKey_all 10 106.402 ns 0.9764 ns 0.9134 ns - - - - Map_containsKey_nonexisting 10 26.962 ns 0.5622 ns 0.5259 ns - - - - MapNew_containsKey_nonexisting 10 7.740 ns 0.0657 ns 0.0614 ns - - - - Map_remove_all 10 697.105 ns 10.8650 ns 9.0727 ns 0.2060 - - 1296 B MapNew_remove_all 10 486.468 ns 8.3427 ns 14.1665 ns 0.1612 - - 1016 B

Author

krauthaufen commented 9 days ago

Hey, benchmarks are still running but I managed to optimize ofArray/ofSeq/ofList dramatically (up to 5x) with the following procedure

  1. copy data to a struct('Key * 'Value)[]
  2. Array.sortInPlace by key
  3. handle duplicates with correct override semantics
  4. build the tree

Again these numbers are from my Notebook:

Method Count Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 1 14.59 ns 0.075 ns 0.062 ns 0.0102 - - 64 B MapNew_ofArray 1 10.95 ns 0.068 ns 0.063 ns 0.0089 - - 56 B MapNew_ofArray_optimized 1 10.37 ns 0.069 ns 0.065 ns 0.0089 - - 56 B Map_ofArray 2 26.58 ns 0.154 ns 0.129 ns 0.0178 - - 112 B MapNew_ofArray 2 32.52 ns 0.259 ns 0.229 ns 0.0166 - - 104 B MapNew_ofArray_optimized 2 78.19 ns 0.430 ns 0.402 ns 0.0318 - - 200 B Map_ofArray 10 750.98 ns 3.208 ns 3.000 ns 0.2050 - - 1288 B MapNew_ofArray 10 578.95 ns 3.925 ns 3.479 ns 0.1421 - - 896 B MapNew_ofArray_optimized 10 298.51 ns 1.078 ns 0.900 ns 0.0968 - - 608 B Map_ofArray 100 21,386.32 ns 74.178 ns 69.387 ns 4.4250 0.1221 - 27856 B MapNew_ofArray 100 11,883.74 ns 78.727 ns 73.641 ns 1.6479 0.0458 - 10424 B MapNew_ofArray_optimized 100 2,757.47 ns 8.749 ns 8.183 ns 0.8278 0.0229 - 5216 B Map_ofArray 1000 407,528.56 ns 6,147.472 ns 5,449.573 ns 70.8008 16.1133 - 445456 B MapNew_ofArray 1000 227,403.18 ns 973.302 ns 812.751 ns 22.4609 4.8828 - 142232 B MapNew_ofArray_optimized 1000 41,938.29 ns 223.936 ns 209.470 ns 7.6904 1.5259 - 48368 B

Author

krauthaufen commented 9 days ago

Cool, I'll definetly investigate that. I'm also working on improving things for very small counts since the array seems to be slower for approx. count < 5. I'll be busy with other things over the weekend but I'll continue there next week.

kerams commented 9 days ago

edited

Fingers crossed you persist and this gets in. I recall seeing PRs with significant collection optimizations that have gradually lost momentum and fizzled out.

Author

krauthaufen commented 9 days ago

Update, the ofArray benchmarks executed on a real machine (notebooks tend to get hot and therefore slower)
https://github.com/krauthaufen/MapNew/wiki/Optimized-ofArray

Author

krauthaufen commented 8 days ago

Hey @NinoFloris I tried using the BCL sort you posted, but it's actually a tiny bit slower. Maybe due to the fact that I need to maintain two arrays instead of one and lookups might be less cache-local.

It's interesting to see, that the BCL code somehow manages to allocate less garbage but is nonetheless slower.
I will try using the BCL sort on struct('Key * 'Value) next.

Method Count Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 100 21.772 us 0.1405 us 0.1246 us 4.5776 0.1221 - 28.14 KB MapNew_ofArray 100 12.866 us 0.0318 us 0.0248 us 2.0294 0.0610 - 12.52 KB MapNew_ofArray_optimized 100 2.765 us 0.0125 us 0.0117 us 0.8278 0.0229 - 5.09 KB MapNew_ofArray_bcl 100 3.720 us 0.0174 us 0.0163 us 0.7629 0.0229 - 4.68 KB Map_ofArray 1000 382.301 us 2.8486 us 2.6645 us 70.8008 16.6016 - 436.05 KB MapNew_ofArray 1000 232.351 us 0.9637 us 0.8047 us 24.6582 5.8594 - 151.3 KB MapNew_ofArray_optimized 1000 42.696 us 0.4129 us 0.3862 us 7.6904 1.5259 - 47.23 KB MapNew_ofArray_bcl 1000 63.367 us 0.5040 us 0.4714 us 6.9580 1.2207 - 43.3 KB

Author

krauthaufen commented 8 days ago

I just found that all of these sorts are non-stable which means that my logic for duplicates won't work (which my tests sadly missed).

I'm currently thinking about ways to work around that problem.

A quick test with Aardvark's TimSort shows that a stable sort might be a little slower, but nonetheless the speedup will be huge.

yatli commented 7 days ago

edited

Map_ofArray 10000 9,227,999.61 ns 183,278.495 ns 180,003.989 ns 968.7500 265.6250 62.5000 6088782 B MapNew_ofArray 10000 5,473,464.24 ns 105,595.415 ns 112,985.911 ns 296.8750 148.4375 - 1899776 B MapNew_ofArray_optimized 10000 888,288.21 ns 7,726.569 ns 6,452.035 ns 79.1016 39.0625 - 501801 B

That's huge.

goswinr commented 7 days ago

@krauthaufen
I guess you have seen the recent changes to Map.

I also wonder if a Concestor dictionary could replace the Map? apparently it was at least 4 times faster (a few years ago). Is that something worth considering?

Author

krauthaufen commented 7 days ago

edited

Hey, yep I've seen those and afaik they're included in FSharp.Core >= 5.0.0 right?

The Concestor dictionary could be worth considering but my experience shows that in the end a simple balanced binary tree always outperforms other comparison-based datastructures. I'm definetly not against considering that, it's just a different kind of monster with other problems. I think that for such a low-level datastructure there is simply no super-elegant super-fast solution.

Regarding the problem from above (duplicates in ofArray, etc.) i implemented a workaround using a stable mergeSort implementation.

Current State

I think I'd like to take a little more time to improve that but I also think that the speedup is still huge (see table below) for ofArray, ofSeq, ofList and all the other combinators (see benchmarks) also significantly outperform the current implementation.

Override problem

Maybe someone has a better idea? any ideas welcome

ofArray currently works as follows:

  1. The input is an array of 'Key * 'Value tuples and it is ensured (via copy, etc) that my sorting may mutate the array-contents
  2. sort the things by their key (in a stable way)
  3. "compact" identical keys via iterating over the array again (and maintain a new count)
  4. visit the array in a binary-search-like way to create the nodes.

The stable property is needed for correctly resolving conflicts [|(0,1); (0,3)|] should produce a map holding (0,3)

The problem is that stable sorts (such as my simple mergesort) tend to be slower than e.g. quicksort. This can be seen here:

Method Count Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 10 797.8 ns 15.18 ns 14.20 ns 0.2165 - - 1360 B MapNew_ofArray_optimized 10 514.7 ns 10.36 ns 13.83 ns 0.0992 - - 624 B MapNew_ofArray_WRONG 10 429.1 ns 8.00 ns 7.10 ns 0.0968 - - 608 B Map_ofArray 100 21,382.7 ns 129.49 ns 114.79 ns 4.4250 0.1221 - 27856 B MapNew_ofArray_optimized 100 6,307.1 ns 105.20 ns 93.25 ns 0.8850 0.0229 - 5592 B MapNew_ofArray_WRONG 100 3,859.6 ns 73.73 ns 81.95 ns 0.8240 0.0229 - 5216 B Map_ofArray 1000 399,811.7 ns 5,783.82 ns 5,410.19 ns 70.8008 16.1133 - 445144 B MapNew_ofArray_optimized 1000 102,484.6 ns 1,988.17 ns 1,762.46 ns 8.3008 1.5869 - 52344 B MapNew_ofArray_WRONG 1000 59,636.1 ns 824.82 ns 731.18 ns 7.6904 1.4648 - 48368 B

Things i tried so far:

  1. using Array.sort with struct('Key * int) (the int being the source index) ended up to be quite slow
  2. tweaking a quicksort to remove duplicates inline (didn't manage to get my head around that)

Things that could be done to improve the stable-sorting a little:

  1. inline the duplicate-removal in the last merge step (saves iterating the array once)
  2. optimize the mergeSort
  3. conisder uising something more sophisticated (e.g. TimSort) -> hard to implement

Author

krauthaufen commented 6 days ago

Hey, I just updated the benchmarks.
Everything got a little faster again, so almost all combinators are now 1.2x - 3x faster than the original map.
ofArray for 10k elements is even 5 times faster.

yatli commented 6 days ago

@krauthaufen have you tried to run: https://github.com/buybackoff/fsharp-benchmarks/blob/master/FSharpCoreOptim/Bench.fs

I'd also like to test for small key/value (intint) vs. larger key/value (intsome_reftype), (int*large_valuetype) etc.

wrt ofArray: one idea is to create map nodes during a partitioned sort so you don't have to go over the result again later. the result tree can be balanced fairly easily.

Author

krauthaufen commented 6 days ago

Hey, I will run the benchmarks on other key/value types soon, but the ones you mentioned seem to be "unfair" since they only test adding/removing sorted keys. Mine actually look quite similar but with randomized data. Obviously this creates a little jitter in the runtimes, but nonetheless I think it is "fairer"

yatli commented 6 days ago

Fair point. I brought that up because it contains somewhat richer pattern (for all stuff in the map, access by key and then ignore etc.)

Your trees are mostly balanced so it'll probably do a good job in those tests.

yatli commented 6 days ago

I just found that all of these sorts are non-stable which means that my logic for duplicates won't work (which my tests sadly missed).

hmm I don't understand, do you want to mimic the behavior of keeping the last entry of the same key? otherwise stability doesn't affect dedup.

another idea is to allocate the temp array of actual map nodes so you don't have to create the nodes again after the sort -- not sure if it's a good idea because of another level of indirection, though. (memory efficiency ++, speed --)

yatli commented 6 days ago

edited

I'd like to also mention the runtime profiles -- I'm running the benchmark now and it looks like it's only testing the workstation concurrent GC scenario. We should perhaps also test with the config option <gcServer>true</gcServer>

yatli commented 6 days ago

edited

Here's the workstation GC result for count = 100 on my machine:

// * Summary *
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.685 (2004/?/20H1)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  DefaultJob : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT
Method Count Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 100 21,540.72 ns 130.719 ns 109.156 ns 6.5918 - - 27616 B MapNew_ofArray 100 6,030.92 ns 72.043 ns 60.159 ns 1.3351 - - 5592 B Map_ofList 100 22,660.64 ns 441.946 ns 526.105 ns 6.6223 - - 27712 B MapNew_ofList 100 6,572.73 ns 71.084 ns 59.358 ns 1.8158 - - 7608 B Map_ofSeq 100 23,373.67 ns 133.495 ns 111.475 ns 6.8665 - - 28744 B MapNew_ofSeq 100 7,197.75 ns 95.528 ns 84.683 ns 1.8234 - - 7648 B Map_toArray 100 1,821.67 ns 21.711 ns 20.308 ns 1.5354 - - 6424 B MapNew_toArray 100 784.71 ns 6.124 ns 5.429 ns 0.7763 - - 3248 B Map_toList 100 1,445.31 ns 24.479 ns 25.138 ns 1.3390 - - 5600 B MapNew_toList 100 1,031.56 ns 5.313 ns 4.710 ns 1.3390 - - 5600 B Map_enumerate 100 3,570.17 ns 21.298 ns 19.922 ns 1.8616 - - 7800 B MapNew_enumerate 100 1,764.33 ns 33.628 ns 35.981 ns 0.9556 - - 4000 B Map_toSeq_enum 100 5,274.36 ns 10.658 ns 8.900 ns 2.3270 - - 9752 B MapNew_toSeq_enum 100 4,716.17 ns 81.952 ns 100.644 ns 1.5717 - - 6592 B Map_containsKey_all 100 5,530.57 ns 93.365 ns 99.900 ns - - - - MapNew_containsKey_all 100 2,424.88 ns 22.708 ns 20.130 ns - - - - Map_containsKey_nonexisting 100 53.51 ns 0.570 ns 0.445 ns - - - - MapNew_containsKey_nonexisting 100 24.84 ns 0.525 ns 0.491 ns - - - - Map_tryFind 100 50.59 ns 0.159 ns 0.132 ns 0.0057 - - 24 B MapNew_tryFind 100 27.06 ns 0.065 ns 0.060 ns 0.0057 - - 24 B Map_tryFind_nonexisting 100 41.01 ns 0.745 ns 0.697 ns - - - - MapNew_tryFind_nonexisting 100 22.09 ns 0.131 ns 0.116 ns - - - - Map_remove_all 100 18,913.17 ns 72.131 ns 63.942 ns 6.4697 - - 27104 B MapNew_remove_all 100 15,602.63 ns 219.163 ns 183.011 ns 5.6763 - - 23816 B Map_exists 100 704.56 ns 9.028 ns 8.003 ns 0.0057 - - 24 B MapNew_exists 100 353.35 ns 2.403 ns 2.248 ns 0.0057 - - 24 B Map_fold 100 639.10 ns 2.954 ns 2.467 ns 0.0057 - - 24 B MapNew_fold 100 393.61 ns 7.322 ns 6.849 ns 0.0057 - - 24 B Map_foldBack 100 637.56 ns 6.142 ns 5.445 ns 0.0057 - - 24 B MapNew_foldBack 100 383.18 ns 2.343 ns 1.957 ns 0.0057 - - 24 B

Looks like we've got similar specs :)

Author

krauthaufen commented 6 days ago

Nice to see that I was not hallucinating my benchmark results yum

hmm I don't understand, do you want to mimic the behavior of keeping the last entry of the same key?

Precisely. That's why stable sorting is crucial here and my simple mergeSort was (so far) faster than anything else i tried.

another idea is to allocate the temp array of actual map nodes so you don't have to create the nodes again after the sort -- not sure if it's a good idea because of another level of indirection, though. (memory efficiency ++, speed --)

Not sure if this can help, I'd need to allocate leaves for everything and then combine them together in inner-nodes, etc.
The current implementation just sorts the tuples (without allocating new ones) and then allocates precisely the output-nodes.

During this process i need to allocate 2 auxillary Key*Value arrays of length n for the mergeSort, but I think that's reasonable and they get unreachable directly after the tree-build. Of course 0/1 aux array would be better but then we would need an in-place stable sort (like TimSort) that is equally fast.

yatli commented 6 days ago

edited

The gcServer profile:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.685 (2004/?/20H1)
Intel Core i7-6700K CPU 4.00GHz (Skylake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT DEBUG
  Job-HPVMQI : .NET Core 3.1.9 (CoreCLR 4.700.20.47201, CoreFX 4.700.20.47203), X64 RyuJIT

Server=True  

Method Count Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 100 21,309.23 ns 92.137 ns 86.185 ns 1.00 0.00 0.7324 - - 27400 B MapNew_ofArray 100 5,874.02 ns 113.713 ns 166.680 ns 0.28 0.01 0.1373 - - 5592 B

Map_ofList 100 23,646.63 ns 396.236 ns 486.613 ns 1.00 0.00 0.7019 - - 28768 B MapNew_ofList 100 6,447.32 ns 97.939 ns 91.612 ns 0.27 0.01 0.1907 - - 7608 B

Map_ofSeq 100 23,871.15 ns 386.093 ns 396.489 ns 1.00 0.00 0.7019 - - 28552 B MapNew_ofSeq 100 6,790.94 ns 46.358 ns 36.193 ns 0.28 0.01 0.1907 - - 7648 B

Map_toArray 100 1,862.72 ns 31.103 ns 25.973 ns ? ? 0.1659 - - 6424 B MapNew_toArray 100 850.00 ns 16.036 ns 15.750 ns ? ? 0.0849 - - 3248 B

Map_toList 100 1,486.52 ns 3.886 ns 3.245 ns ? ? 0.1411 - - 5600 B MapNew_toList 100 1,138.65 ns 4.410 ns 3.683 ns ? ? 0.1431 - - 5600 B

Map_enumerate 100 3,757.96 ns 74.561 ns 111.599 ns 1.00 0.00 0.1907 - - 7680 B MapNew_enumerate 100 1,625.09 ns 6.558 ns 6.135 ns 0.42 0.01 0.1030 - - 4000 B

Map_toSeq_enum 100 5,496.65 ns 72.550 ns 67.863 ns 1.00 0.00 0.2594 - - 10112 B MapNew_toSeq_enum 100 4,813.82 ns 95.631 ns 146.039 ns 0.89 0.02 0.1602 - - 6592 B

Map_containsKey_all 100 5,308.16 ns 28.346 ns 26.515 ns 1.00 0.00 - - - - MapNew_containsKey_all 100 2,206.53 ns 27.926 ns 21.803 ns 0.42 0.00 - - - -

Map_containsKey_nonexisting 100 53.82 ns 0.841 ns 0.826 ns 1.00 0.00 - - - - MapNew_containsKey_nonexisting 100 22.35 ns 0.120 ns 0.101 ns 0.41 0.01 - - - -

Map_tryFind 100 54.65 ns 0.321 ns 0.268 ns 1.00 0.00 0.0006 - - 24 B MapNew_tryFind 100 21.80 ns 0.478 ns 0.424 ns 0.40 0.01 0.0006 - - 24 B

Map_tryFind_nonexisting 100 55.36 ns 0.601 ns 0.562 ns 1.00 0.00 - - - - MapNew_tryFind_nonexisting 100 21.59 ns 0.407 ns 0.381 ns 0.39 0.01 - - - -

Map_remove_all 100 20,126.44 ns 150.161 ns 133.114 ns 1.00 0.00 0.7019 - - 28136 B MapNew_remove_all 100 15,826.98 ns 294.046 ns 275.051 ns 0.79 0.01 0.6104 - - 23288 B

Map_exists 100 691.36 ns 3.394 ns 3.175 ns 1.00 0.00 - - - 24 B MapNew_exists 100 367.96 ns 5.424 ns 4.808 ns 0.53 0.01 0.0005 - - 24 B

Map_fold 100 610.81 ns 3.087 ns 2.736 ns 1.00 0.00 - - - 24 B MapNew_fold 100 386.97 ns 0.978 ns 0.867 ns 0.63 0.00 0.0005 - - 24 B

Map_foldBack 100 594.96 ns 2.724 ns 2.548 ns 1.00 0.00 - - - 24 B MapNew_foldBack 100 383.56 ns 7.311 ns 6.839 ns 0.64 0.01 0.0005 - - 24 B

Conclusion: even more improvements with server profile!

yatli commented 6 days ago

Sorting.fs,L221:

        while sortedLengthDbl <= src.Length do

consider change to while sortedLength < src.Length? In doing so we can remove:

Sorting.fs,L224:

        if sortedLength < src.Length then
            let cnt = mergeSeqHandleDuplicates cmp 0 sortedLength sortedLength src dst
            struct(dst, cnt)
        else

Author

krauthaufen commented 6 days ago

This was actually intentional since it saves iterating the array once and saves a few percent.

Author

krauthaufen commented 6 days ago

I'm currently focusing on getting ofList/ofSeq/ofArray faster for very small counts (benchmarks currently running). The code was a bit painful to write but I think it might be worth it.

Author

krauthaufen commented 6 days ago

Hey, my optimizations for super-small counts seem to work decently.

For counts in [1..8] the new implementation is always at least as fast (within error margins) as the original one and gets a whole lot faster for large inputs.

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-9750H CPU 2.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.200-preview.20601.7
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT DEBUG
  Job-DPNHNG : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Server=True

Method Count Mean Error StdDev Median Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated Map_ofArray 1 17.07 ns 0.113 ns 0.106 ns 17.03 ns 1.00 0.00 0.0007 - - 64 B MapNew_ofArray 1 14.48 ns 0.355 ns 0.816 ns 14.79 ns 0.79 0.07 0.0006 - - 56 B

Map_ofList 1 27.63 ns 0.136 ns 0.127 ns 27.66 ns 1.00 0.00 0.0010 - - 88 B MapNew_ofList 1 11.73 ns 0.121 ns 0.113 ns 11.70 ns 0.42 0.00 0.0006 - - 56 B

Map_ofSeq 1 30.38 ns 0.188 ns 0.176 ns 30.38 ns 1.00 0.00 0.0010 - - 88 B MapNew_ofSeq 1 15.98 ns 0.051 ns 0.040 ns 15.99 ns 0.53 0.00 0.0006 - - 56 B

Map_ofArray 2 31.17 ns 0.141 ns 0.132 ns 31.19 ns 1.00 0.00 0.0013 - - 112 B MapNew_ofArray 2 32.12 ns 0.164 ns 0.154 ns 32.18 ns 1.03 0.01 0.0011 - - 104 B

Map_ofList 2 41.19 ns 0.166 ns 0.147 ns 41.18 ns 1.00 0.00 0.0015 - - 136 B MapNew_ofList 2 31.41 ns 0.199 ns 0.186 ns 31.37 ns 0.76 0.00 0.0011 - - 104 B

Map_ofSeq 2 43.78 ns 0.164 ns 0.137 ns 43.78 ns 1.00 0.00 0.0015 - - 136 B MapNew_ofSeq 2 37.03 ns 0.448 ns 0.397 ns 36.94 ns 0.85 0.01 0.0011 - - 104 B

Map_ofArray 3 84.56 ns 0.516 ns 0.482 ns 84.45 ns 1.00 0.00 0.0023 - - 208 B MapNew_ofArray 3 90.89 ns 0.630 ns 0.589 ns 90.69 ns 1.07 0.01 0.0021 - - 200 B

Map_ofList 3 79.10 ns 0.253 ns 0.224 ns 79.11 ns 1.00 0.00 0.0023 - - 208 B MapNew_ofList 3 54.52 ns 0.347 ns 0.324 ns 54.42 ns 0.69 0.00 0.0014 - - 128 B

Map_ofSeq 3 98.81 ns 0.500 ns 0.443 ns 98.70 ns 1.00 0.00 0.0025 - - 232 B MapNew_ofSeq 3 90.07 ns 0.668 ns 0.592 ns 90.21 ns 0.91 0.01 0.0023 - - 200 B

Map_ofArray 4 182.91 ns 0.601 ns 0.563 ns 182.89 ns 1.00 0.00 0.0041 - - 376 B MapNew_ofArray 4 94.31 ns 1.924 ns 2.760 ns 92.69 ns 0.52 0.02 0.0019 - - 176 B

Map_ofList 4 218.61 ns 0.670 ns 0.627 ns 218.55 ns 1.00 0.00 0.0050 - - 448 B MapNew_ofList 4 105.21 ns 0.439 ns 0.390 ns 105.22 ns 0.48 0.00 0.0025 - - 224 B

Map_ofSeq 4 176.96 ns 0.594 ns 0.526 ns 176.78 ns 1.00 0.00 0.0038 - - 352 B MapNew_ofSeq 4 140.27 ns 1.661 ns 1.387 ns 140.14 ns 0.79 0.01 0.0029 - - 272 B

Map_ofArray 5 255.16 ns 1.817 ns 1.517 ns 254.98 ns 1.00 0.00 0.0052 - - 496 B MapNew_ofArray 5 126.41 ns 0.737 ns 0.690 ns 126.46 ns 0.50 0.00 0.0024 - - 224 B

Map_ofList 5 188.18 ns 0.677 ns 0.566 ns 188.12 ns 1.00 0.00 0.0043 - - 400 B MapNew_ofList 5 122.11 ns 0.233 ns 0.207 ns 122.14 ns 0.65 0.00 0.0021 - - 200 B

Map_ofSeq 5 223.41 ns 1.351 ns 1.264 ns 223.52 ns 1.00 0.00 0.0050 - - 448 B MapNew_ofSeq 5 276.21 ns 2.683 ns 2.509 ns 275.32 ns 1.24 0.01 0.0057 - - 512 B

Map_ofArray 6 265.37 ns 1.009 ns 0.895 ns 265.47 ns 1.00 0.00 0.0057 - - 520 B MapNew_ofArray 6 207.73 ns 0.587 ns 0.520 ns 207.74 ns 0.78 0.00 0.0043 - - 392 B

Map_ofList 6 272.02 ns 1.520 ns 1.422 ns 271.67 ns 1.00 0.00 0.0057 - - 544 B MapNew_ofList 6 221.65 ns 0.863 ns 0.765 ns 221.63 ns 0.81 0.00 0.0052 - - 472 B

Map_ofSeq 6 275.55 ns 0.819 ns 0.726 ns 275.70 ns 1.00 0.00 0.0057 - - 544 B MapNew_ofSeq 6 232.76 ns 0.823 ns 0.770 ns 232.49 ns 0.84 0.00 0.0052 - - 472 B

Map_ofArray 7 451.17 ns 1.268 ns 1.124 ns 451.04 ns 1.00 0.00 0.0086 - - 784 B MapNew_ofArray 7 237.82 ns 1.106 ns 0.980 ns 237.71 ns 0.53 0.00 0.0048 - - 432 B

Map_ofList 7 362.27 ns 1.705 ns 1.595 ns 362.23 ns 1.00 0.00 0.0072 - - 664 B MapNew_ofList 7 261.09 ns 1.908 ns 1.692 ns 261.01 ns 0.72 0.00 0.0052 - - 504 B

Map_ofSeq 7 506.84 ns 2.857 ns 2.533 ns 506.19 ns 1.00 0.00 0.0095 - - 904 B MapNew_ofSeq 7 259.65 ns 1.002 ns 0.937 ns 259.80 ns 0.51 0.00 0.0052 - - 504 B

Map_ofArray 8 574.78 ns 3.576 ns 3.345 ns 574.83 ns 1.00 0.00 0.0105 - - 1000 B MapNew_ofArray 8 282.62 ns 0.973 ns 0.863 ns 282.76 ns 0.49 0.00 0.0052 - - 496 B

Map_ofList 8 613.85 ns 2.704 ns 2.397 ns 613.49 ns 1.00 0.00 0.0124 - - 1096 B MapNew_ofList 8 307.42 ns 1.513 ns 1.415 ns 307.66 ns 0.50 0.00 0.0062 - - 560 B

Map_ofSeq 8 606.14 ns 2.841 ns 2.657 ns 606.27 ns 1.00 0.00 0.0114 - - 1048 B MapNew_ofSeq 8 311.47 ns 1.205 ns 1.068 ns 311.31 ns 0.51 0.00 0.0062 - - 560 B

yatli commented 6 days ago

This was actually intentional since it saves iterating the array once and saves a few percent.

Hmm if so why not force the condition sortedLength < src.Length? Currently arrays of length 2^n do not benefit this last round run.

Author

krauthaufen commented 6 days ago

good idea, thanks.
it was a little late when i coded that sweat_smile

yatli commented 6 days ago

How do we maintain compatibility with the serialization libraries?
Like Json.Net etc.

Author

krauthaufen commented 6 days ago

Hey, I just did some Benchmarks with decimal as Key and/or Value and a ReferenceType (custom wrapped decimal) which all show more or less the identical behaviour as the int*int case.

However the LanguagePrimitives.FastGenericComparer boxes custom structs (also Guid for example) and when that happens the new implementation performs more or less identical to the old one. I should note that the overall runtime for almost all operations is drastically increased for both implementations in that case. (I think there's an old issue covering that)

A possible explanation for this is that, given an expensive (boxing) comparison, the new implementation of ofArray may perform slightly more comparisons than the old one.

The serialization thing is really a little tricky, since I have no idea how these serializations look internally?
Another question is whether or not these shall even be compatible?
FsPickler for example has a special-case for Map (using toArray, ofArray afaik.)

Author

krauthaufen commented 6 days ago

A quick check revealed that Json.Net does:

let m = MapNew.ofArray [|1,2;3,4;5,6|]
let json = Json.Net.JsonNet.Serialize(m)
printfn "%s" json // => ["1":2,"3":4,"5":6]

Maybe due to my IDictionary<'Key, 'Value> implementation?

yatli commented 6 days ago

["1":2,"3":4,"5":6]

Quite good then! It matches the results of the old map.

Collaborator

dsyme commented 5 days ago

edited

Amazing work here!

For serialization I believe the contract is a serializedData field that gets populated on-demand, here's the current implementation

Is this feasible?

    [<System.NonSerialized>]
    // This type is logically immutable. This field is only mutated during deserialization.
    let mutable comparer = comparer
 
    [<System.NonSerialized>]
    // This type is logically immutable. This field is only mutated during deserialization.
    let mutable tree = tree

    // WARNING: The compiled name of this field may never be changed because it is part of the logical
    // WARNING: permanent serialization format for this type.
    let mutable serializedData = null

    [<System.Runtime.Serialization.OnSerializingAttribute>]
    member __.OnSerializing(context: System.Runtime.Serialization.StreamingContext) =
        ignore context
        serializedData <- MapTree.toArray tree |> Array.map (fun (k, v) -> KeyValuePair(k, v))

    [<System.Runtime.Serialization.OnDeserializedAttribute>]
    member __.OnDeserialized(context: System.Runtime.Serialization.StreamingContext) =
        ignore context
        comparer <- LanguagePrimitives.FastGenericComparer<'Key>
        tree <- serializedData |> Array.map (fun kvp -> kvp.Key, kvp.Value) |> MapTree.ofArray comparer
        serializedData <- null

Author

krauthaufen commented 5 days ago

Hey, we could certainly do that.

Btw I finally started creating a Fable app for visualizing benchmark results and it's just amazing how simple that is grin

I'll add the serialization-stuff once I'm done with my visualization

yatli commented 5 days ago

edited

However the LanguagePrimitives.FastGenericComparer boxes custom structs (also Guid for example) and when that happens the new implementation performs more or less identical to the old one.

The optimizer is aware of this path and will try to de-virtualize the call (see generic_comparison_inner_vref)
But it only works on F# generated structs (those with CompareTo)

  • prim-type.fs,L1095,GenericComparisonIntrinsic
  • Optimizer.fs,L2602, ... when CanDevirtualizeApplication cenv v cenv.g.generic_comparison_inner_vref ty args ...

I think we can definitely do better than the most generic comparer in some cases (e.g. structs, non-null stuff) -- we can add them to the optimizer as separate comparison de-virtualization cases.

Edit: for well-known C# types (Guid etc.) we can directly add them to the fast generic comparer table, see prim-types.fs,L2084, type FastGenericEqualityComparerTable<'T> ...

Author

krauthaufen commented 4 days ago

Hey, The virtual call is only one part of the problem here I think.

here is a :? IComparable which works but still boxes the struct value and therefore allocates garbage. Then the virtual CompareTo(o : obj) gets invoked which again boxes the second argument.

So for structs we effectively pay 2 boxings and 1 virtual call (and maybe a tuple allocation if the compiler doesn't optimize it)

I have no idea how this is avoided in System.Collections.Generic.Comparer<'T>.Default but it somehow seems to be (at least my tests are way faster when using it)

Btw the same goes for custom structs and Equals (when structs override Equals): the struct gets boxed and the virtual Equals is invoked.

But I think that's a different issue.

yatli commented 4 days ago

here is a :? IComparable which works but still boxes the struct value and therefore allocates garbage. Then the virtual CompareTo(o : obj) gets invoked which again boxes the second argument.

If properly optimized it won't go into this path. The optimizer will rewrite the AST.

Author

krauthaufen commented 4 days ago

Regarding the Map implementation:

  • I added several new things: union/unionWith/GetSlice/etc.
  • All benchmarks for the current state are running atm. with better iteration-counts, etc. (may take a while)
  • I think that I'm done optimizing the code and profiles indicate that there is very little left to optimize. (I'm also out of ideas rofl)
  • So when you really want this in FSharp.Core I could create a Set implementation and put together a PR for FSharp.Core

The preliminary benchmark results (small iteration counts) can be seen here

yatli commented 4 days ago

Fantastic visualization smile

This is absolutely fantastic, thank you for all that work @krauthaufen!

yatli commented 3 days ago

@krauthaufen maybe I can help with the optimizer if this suggestion is approved. We can then set up a branch and work with FSharp.Core and fsc directly.

JustinWick commented 3 hours ago

edited

@krauthaufen If this pans out, it will be absurdly useful. Thanks for all of your hard work, this is exciting, and educational for all of us here on the thread!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Assignees

No one assigned

Labels
None yet
Projects

None yet

Milestone

No milestone

Linked pull requests

Successfully merging a pull request may close this issue.

None yet

8 participants

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK