12

内存拷贝优化(3)-深入优化

 2 years ago
source link: https://www.skywind.me/blog/archives/1587
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

内存拷贝优化(3)-深入优化

今天继续在原来内存拷贝代码上优化:

1. 修改了小内存方案:由原来64字节扩大为128字节,由 int 改为 xmm,小内存性能提升 80%
2. 修改了中内存方案:从4个xmm寄存器并行拷贝改为8个并行拷贝+prefetch,提升20%左右
3. 去除目标地址头部对齐的分支判断,用一次xmm拷贝完成目标对齐,性能替升10%。
4. 增加测试用例:为贴近实际,增加了随机访问,10MB空间内(绝对大于L2尺寸)随机位置和长度的测试

为避免随机数生成影响结果,提前生成随机数,最终平均性能达到gcc4.9配套标准库的2倍以上:

https://github.com/skywind3000/FastMemcpy

最新代码测试结果(可以对比老的表看新版本性能是否有所提升):

benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=78ms memcpy=260 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=250 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=266 ms
result(dst unalign, src unalign): memcpy_fast=78ms memcpy=234 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=109ms memcpy=343 ms
result(dst unalign, src unalign): memcpy_fast=93ms memcpy=344 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=125ms memcpy=218 ms
result(dst aligned, src unalign): memcpy_fast=156ms memcpy=484 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=546 ms
result(dst unalign, src unalign): memcpy_fast=172ms memcpy=515 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=109ms memcpy=172 ms
result(dst aligned, src unalign): memcpy_fast=187ms memcpy=453 ms
result(dst unalign, src aligned): memcpy_fast=172ms memcpy=437 ms
result(dst unalign, src unalign): memcpy_fast=156ms memcpy=452 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=109ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=218 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=62ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=78ms memcpy=202 ms
result(dst unalign, src aligned): memcpy_fast=78ms memcpy=203 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=203 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=203ms memcpy=191 ms
result(dst aligned, src unalign): memcpy_fast=219ms memcpy=281 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=328 ms
result(dst unalign, src unalign): memcpy_fast=218ms memcpy=312 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=312ms memcpy=406 ms
result(dst aligned, src unalign): memcpy_fast=296ms memcpy=421 ms
result(dst unalign, src aligned): memcpy_fast=312ms memcpy=468 ms
result(dst unalign, src unalign): memcpy_fast=297ms memcpy=452 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=452 ms
result(dst aligned, src unalign): memcpy_fast=280ms memcpy=468 ms
result(dst unalign, src aligned): memcpy_fast=298ms memcpy=514 ms
result(dst unalign, src unalign): memcpy_fast=344ms memcpy=472 ms

benchmark random access:
memcpy_fast=515ms memcpy=1014ms

老的测试结果为:

result: gcc4.9 (msvc 2012 got a similar result):

benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=180ms memcpy=249 ms
result(dst aligned, src unalign): memcpy_fast=170ms memcpy=271 ms
result(dst unalign, src aligned): memcpy_fast=179ms memcpy=269 ms
result(dst unalign, src unalign): memcpy_fast=180ms memcpy=260 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=162ms memcpy=300 ms
result(dst aligned, src unalign): memcpy_fast=199ms memcpy=328 ms
result(dst unalign, src aligned): memcpy_fast=410ms memcpy=339 ms
result(dst unalign, src unalign): memcpy_fast=390ms memcpy=361 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=160ms memcpy=241 ms
result(dst aligned, src unalign): memcpy_fast=200ms memcpy=519 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=509 ms
result(dst unalign, src unalign): memcpy_fast=311ms memcpy=520 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=145ms memcpy=179 ms
result(dst aligned, src unalign): memcpy_fast=180ms memcpy=430 ms
result(dst unalign, src aligned): memcpy_fast=245ms memcpy=430 ms
result(dst unalign, src unalign): memcpy_fast=230ms memcpy=455 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=80ms memcpy=80 ms
result(dst aligned, src unalign): memcpy_fast=110ms memcpy=205 ms
result(dst unalign, src aligned): memcpy_fast=110ms memcpy=224 ms
result(dst unalign, src unalign): memcpy_fast=110ms memcpy=200 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=78 ms
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=222 ms
result(dst unalign, src aligned): memcpy_fast=100ms memcpy=210 ms
result(dst unalign, src unalign): memcpy_fast=100ms memcpy=230 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=200ms memcpy=201 ms
result(dst aligned, src unalign): memcpy_fast=260ms memcpy=270 ms
result(dst unalign, src aligned): memcpy_fast=263ms memcpy=361 ms
result(dst unalign, src unalign): memcpy_fast=267ms memcpy=321 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=281ms memcpy=391 ms
result(dst aligned, src unalign): memcpy_fast=265ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=313ms memcpy=453 ms
result(dst unalign, src unalign): memcpy_fast=282ms memcpy=439 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=266ms memcpy=422 ms
result(dst aligned, src unalign): memcpy_fast=250ms memcpy=407 ms
result(dst unalign, src aligned): memcpy_fast=297ms memcpy=516 ms
result(dst unalign, src unalign): memcpy_fast=281ms memcpy=436 ms

benchmark random access:
memcpy_fast=594ms memcpy=1161ms

旧文索引:

内存拷贝优化(1)-小内存拷贝优化

内存拷贝优化(2)-全尺寸拷贝优化

981 total views, 1 view today

I like thisUnlike LikeI dislike thisUndislike 


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK