5

字符串压缩(三)之短字符串压缩 - T-BARBARIANS

 1 year ago
source link: https://www.cnblogs.com/t-bar/p/16506289.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

字符串压缩(三)之短字符串压缩

  本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/16506289.html 谢谢!

 前言

  上一篇探索了LZ4的压缩和解压性能,以及对LZ4和ZSTD的压缩、解压性能进行了横向对比。文末的最后也给了一个彩蛋:任意长度的字符串都可以被ZSTD、LZ4之类的压缩算压缩得很好吗?

  本篇我们就来一探究竟。

一、通用算法的短字符压缩

  开门见山,我们使用一段比较短的文本:Narrator: It is raining today. So, Peppa and George cannot  play outside.Peppa: Daddy, it's stopped raining.

使用ZSTD与LZ4分别压缩一下上面这段短文本。下面分别是它们的压缩结果。

  ZSTD:

  

1485868-20220809174316064-1067885451.png

   LZ4:

  1485868-20220809174413887-1557165441.png

  对短文本的压缩,zstd的压缩率很低,lz4压缩后的文本长度尽然超过了原有字符串的长度。这是为什么?说实话在这之前我也没想到。

  引用两位大佬的名言:

  Are you ok?  

What's your problem?

二、短字符串压缩

  从上面的结果可以得知,任何压缩算法都有它的使用场景,并不是所有长度的字符串都适合被某种算法压缩。一般原因是通用压缩算法维护了被压缩字符串的,用于字符串还原的相关数据结构,而这些数据结构的长度超过了被压缩短字符串的自身长度。

  那么问题来了,“我真的有压缩短字符串的需求,我想体验压缩的极致感,怎么办?”。

  短字符压缩算法它来了。这里挑选了3种比较优异的短字符压缩算法,分别是smaz,shoco,以及压轴的unisox2。跟前两章一样,还是从压缩率,压缩和解压缩性能的角度,一起看看他们在短字符压缩场景的各自表现吧。

(1)Smaz

1、Smaz的压缩和解压缩

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include <iostream>
 4 #include "smaz.h"
 5 
 6 using namespace std;
 7 
 8 int main()
 9 {
10     int buf_len;
11     int com_size;
12     int decom_size;
13     
14     char com_buf[4096] = {0};
15     char decom_buf[4096] = {0};
16 
17     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
18 
19     buf_len = strlen(str_buf);
20     com_size = smaz_compress(str_buf, buf_len, com_buf, 4096);
21     
22     cout << "text size:" << buf_len << endl;
23     cout << "compress text size:" << com_size << endl;
24     cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl;
25 
26     decom_size = smaz_decompress(com_buf, com_size, decom_buf, 4096);
27     cout << "decompress text size:" << decom_size << endl;
28 
29     if(strncmp(str_buf, decom_buf, buf_len)) {
30         cout << "decompress text is not equal to source text" << endl;
31     }
32 
33     return 0;
34 }

  执行结果如下:

   1485868-20220809192931760-1344766981.png

   通过smaz压缩后的短字符串长度为77,和源字符串相比,减少了30Byte。

2、Smaz的压缩和解压缩性能

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include <iostream>
 4 #include <sys/time.h>
 5 #include "smaz.h"
 6 
 7 using namespace std;
 8 
 9 int main()
10 {
11     int cnt = 0;
12     int buf_len;
13     int com_size;
14     int decom_size;
15 
16     timeval st, et;
17     
18     char *com_ptr = NULL;
19     char* decom_ptr = NULL;
20 
21     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
22 
23     buf_len = strlen(str_buf);
24     gettimeofday(&st, NULL);
25     while(1) {
26         
27         com_ptr = (char *)malloc(buf_len);
28         com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len);
29 
30         free(com_ptr);
31         cnt++;
32         
33         gettimeofday(&et, NULL);
34         if(et.tv_sec - st.tv_sec >= 10) {
35             break;
36         }
37     }
38 
39     cout << endl <<"compress per second:" << cnt/10 << " times" << endl;
40 
41     cnt = 0;
42     com_ptr = (char *)malloc(buf_len);
43     com_size = smaz_compress(str_buf, buf_len, com_ptr, buf_len);
44     
45     gettimeofday(&st, NULL);
46     while(1) {
47 
48         // decompress length not more than origin buf length
49         decom_ptr = (char *)malloc(buf_len + 1);
50         decom_size = smaz_decompress(com_ptr, com_size, decom_ptr, buf_len + 1);
51 
52         // check decompress length
53         if(buf_len != decom_size) {
54             cout << "decom error" << endl;
55         }
56         
57         free(decom_ptr);
58         cnt++;
59         
60         gettimeofday(&et, NULL);
61         if(et.tv_sec - st.tv_sec >= 10) {
62             break;
63         }
64     }
65 
66     cout << "decompress per second:" << cnt/10 << " times" << endl << endl;
67     
68     free(com_ptr);
69     return 0;
70 }

  结果如何?

  

1485868-20220810101834587-1922351535.png

   压缩性能在40w条/S,解压在百万级,好像还不错哈!

(2)Shoco

1、Shoco的压缩和解压缩

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include <iostream>
 4 #include "shoco.h"
 5 
 6 using namespace std;
 7 
 8 int main()
 9 {
10     int buf_len;
11     int com_size;
12     int decom_size;
13     
14     char com_buf[4096] = {0};
15     char decom_buf[4096] = {0};
16 
17     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
18 
19     buf_len = strlen(str_buf);
20     com_size = shoco_compress(str_buf, buf_len, com_buf, 4096);
21     
22     cout << "text size:" << buf_len << endl;
23     cout << "compress text size:" << com_size << endl;
24     cout << "compress ratio:" << (float)buf_len / (float)com_size << endl << endl;
25 
26     decom_size = shoco_decompress(com_buf, com_size, decom_buf, 4096);
27     cout << "decompress text size:" << decom_size << endl;
28 
29     if(strncmp(str_buf, decom_buf, buf_len)) {
30         cout << "decompress text is not equal to source text" << endl;
31     }
32 
33     return 0;
34 }

  执行结果如下:

  1485868-20220809193802348-1168192544.png

   通过shoco压缩后的短字符串长度为86,和源字符串相比,减少了21Byte。压缩率比smaz要低。

 2、Shoco的压缩和解压缩性能

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include <iostream>
 4 #include <sys/time.h>
 5 #include "shoco.h"
 6 
 7 using namespace std;
 8 
 9 int main()
10 {
11     int cnt = 0;
12     int buf_len;
13     int com_size;
14     int decom_size;
15 
16     timeval st, et;
17     
18     char *com_ptr = NULL;
19     char* decom_ptr = NULL;
20 
21     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
22 
23     buf_len = strlen(str_buf);
24     gettimeofday(&st, NULL);
25     while(1) {
26         
27         com_ptr = (char *)malloc(buf_len);
28         com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len);
29 
30         free(com_ptr);
31         cnt++;
32         
33         gettimeofday(&et, NULL);
34         if(et.tv_sec - st.tv_sec >= 10) {
35             break;
36         }
37     }
38 
39     cout << endl <<"compress per second:" << cnt/10 << " times" << endl;
40 
41     cnt = 0;
42     com_ptr = (char *)malloc(buf_len);
43     com_size = shoco_compress(str_buf, buf_len, com_ptr, buf_len);
44     
45     gettimeofday(&st, NULL);
46     while(1) {
47 
48         // decompress length not more than origin buf length
49         decom_ptr = (char *)malloc(buf_len + 1);
50         decom_size = shoco_decompress(com_ptr, com_size, decom_ptr, buf_len + 1);
51 
52         // check decompress length
53         if(buf_len != decom_size) {
54             cout << "decom error" << endl;
55         }
56         
57         free(decom_ptr);
58         cnt++;
59         
60         gettimeofday(&et, NULL);
61         if(et.tv_sec - st.tv_sec >= 10) {
62             break;
63         }
64     }
65 
66     cout << "decompress per second:" << cnt/10 << " times" << endl << endl;
67     
68     free(com_ptr);
69     return 0;
70 }

  执行结果如何呢?

  

1485868-20220810095845204-1522560043.png

   holy shit!压缩和解压缩居然都达到了惊人的百万级。就像算法作者们自己说的一样:“在长字符串压缩领域,shoco不想与通用压缩算法竞争,我们的优势是短字符的快速压缩,虽然压缩率很烂!”。这样说,好像也没毛病。

(3)Unisox2

我们再来看看unisox2呢。

1、Unisox2的压缩和解压缩

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include "unishox2.h"
 4 
 5 int main()
 6 {
 7     int buf_len;
 8     int com_size;
 9     int decom_size;
10     
11     char com_buf[4096] = {0};
12     char decom_buf[4096] = {0};
13     
14     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
15     
16     buf_len = strlen(str_buf);
17     com_size = unishox2_compress_simple(str_buf, buf_len, com_buf);
18 
19     printf("text size:%d\n", buf_len);
20     printf("compress text size:%d\n", com_size);
21     printf("compress ratio:%f\n\n", (float)buf_len / (float)com_size);
22     
23     decom_size = unishox2_decompress_simple(com_buf, com_size, decom_buf);
24 
25     printf("decompress text size:%d\n", decom_size);
26 
27     if(strncmp(str_buf, decom_buf, buf_len)) {
28         printf("decompress text is not equal to source text\n");
29     }
30 
31     return 0;
32 }

  结果如下:

  

1485868-20220810105331396-362705966.png

   通过Unisox2压缩后的短字符串长度为67,和源字符串相比,减少了40Byte,相当于是打了6折啊!不错不错。

 2、Unisox2的压缩和解压缩性能

Unisox2的压缩能力目前来看是三者中最好的,如果他的压缩和解压性能也不错的话,那就真的就比较完美了。再一起看看Unisox2的压缩和解压性能吧!

 1 #include <stdio.h>
 2 #include <string.h>
 3 #include <malloc.h>
 4 #include <sys/time.h>
 5 #include "unishox2.h"
 6 
 7 int main()
 8 {
 9     int cnt = 0;
10     int buf_len;
11     int com_size;
12     int decom_size;
13 
14     struct timeval st, et;
15     
16     char *com_ptr = NULL;
17     char* decom_ptr = NULL;
18 
19     char str_buf[1024] = "Narrator: It is raining today. So, Peppa and George cannot play outside.Peppa: Daddy, it's stopped raining.";
20 
21     buf_len = strlen(str_buf);
22     gettimeofday(&st, NULL);
23     while(1) {
24         
25         com_ptr = (char *)malloc(buf_len);
26         com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr);
27 
28         free(com_ptr);
29         cnt++;
30         
31         gettimeofday(&et, NULL);
32         if(et.tv_sec - st.tv_sec >= 10) {
33             break;
34         }
35     }
36 
37     printf("\ncompress per second:%d times\n", cnt/10);
38 
39     cnt = 0;
40     com_ptr = (char *)malloc(buf_len);
41     com_size = unishox2_compress_simple(str_buf, buf_len, com_ptr);
42     
43     gettimeofday(&st, NULL);
44     while(1) {
45 
46         // decompress length not more than origin buf length
47         decom_ptr = (char *)malloc(buf_len + 1);
48         decom_size = unishox2_decompress_simple(com_ptr, com_size, decom_ptr);
49 
50         // check decompress length
51         if(buf_len != decom_size) {
52             printf("decom error\n");
53         }
54         
55         free(decom_ptr);
56         cnt++;
57         
58         gettimeofday(&et, NULL);
59         if(et.tv_sec - st.tv_sec >= 10) {
60             break;
61         }
62     }
63 
64     printf("decompress per second:%d times\n\n", cnt/10);
65     
66     free(com_ptr);
67     return 0;
68 }

  执行结果如下:

  

1485868-20220810113033395-354338860.png

  事与愿违,Unisox2虽然有三个算法中最好的压缩率,可是却也拥有最差的压缩和解压性能。跟前两章分析的不谋而合:有高压缩率,就会损失自身的压缩性能,两者不可兼得。

三、总结

  本篇分享了smaz,shoco,unisox2三种短字符串压缩算法,分别探索了它们各自的压缩率与压缩和解压缩性能,结果如下表所示。

1485868-20220810150956044-915270019.png

  shoco的压缩率最低,但是拥有最高的压缩和解压速率;smaz居中;unisox2拥有最高的压缩率,可是它的压缩和解压性能最低。

  结论与前两章有关长字符串压缩的分析不谋而合:拥有高压缩率,就会损失自身的压缩性能,两者不可兼得。

  实际使用还是看自身需求和环境吧。如果适当压缩就好,那就可以选用shoco,毕竟性能高;想要节约更多的空间,那就选择smaz或者unisox2。

  好了,字符串压缩系列的分享就到此为止了,如果对你有些许帮助,还请各位技术爱好者登录点赞呀,谢谢!

  本文来自博客园,作者:T-BARBARIANS,转载请注明原文链接:https://www.cnblogs.com/t-bar/p/16506289.html 谢谢!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK