

Random access to zlib compressed files
source link: http://lh3.github.io/2014/07/05/random-access-to-zlib-compressed-files
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Random access to zlib compressed files
Zlib/gzip is probably the most popular library/tool for general
data compression. In zlib, there is an API gzseek()
which places the file
position indicator at a specified offset in the uncompressed file. However,
whenever it gets called, it starts from the beginning of the file and reads
through all the data up to the specified offset. For huge files, this is very
slow.
It is actually possible to achieve faster random access in a generic gzip file. The zran.c in the zlib source code package gives an example implementation. It works by keeping 32kB uncompressed data right before an access point. With the 32kB data, we can decompress data after the access point - we do not need to decompress from the beginning. My friend Jue Ruan found this example and and implemented zrio, a small library that keeps the 32kB data in an index file to achieve random access to generic gzip files. This library is used in maqview.
However, keeping 32kB data per access point is quite heavy. To drop this 32kB
dependency, Jue sought a better solution: calling
deflate(stream,Z_FULL_FLUSH)
every 64kB. After Z_FULL_FLUSH
, we can decompress
the following data independent of the previous data – keeping 32kB is not
necessary any more. The resultant compressed stream is still fully compatible
with zlib. Jue implemented this idea in RAZF. In addition to this
stream reset, RAZF also writes an index table at the end of the file. Given
an uncompressed offset, we can look up the table to find the nearest access
point ahead of the offset to achieve random access. The index is much smaller
and the speed is much faster.
The first prototype of BAM was using RAZF. At that time, a major concern was that RAZF is using low-level zlib APIs which were not available in other programming languages. This would limit the adoption of BAM. The size of the index might also become a concern given >100GB files. In the discussion, Gerton Lunter directed us to dictzip, another tool for random access in gzip-compatible files. Dictzip would not work well for a huge BAM due to the constraint of the gzip header. However, its key idea – concatenating small gzip blocks – led Bob Handsaker to design something better: BGZF (section 4.1).
The key observation Bob made in BGZF is that when we seek the middle of a
compressed file, all we need is a virtual position which is not necessarily the
real position in the uncompressed file. In BGZF, the virtual position is a
tuple (block_file_position,in_block_offset)
, where block_file_position
is
the file postion, in the compressed file, of the start of a gzip block and
in_block_offset
is the offset within the uncompressed gzip block. With the
tuple, we can unambiguously pinpoint a byte in the uncompressed file. When we
keep the tuple in an index file, we can jump to the position without looking up
another index. BGZF is smaller than RAZF and easier to implement. It has been
implemented in C, Java, Javascript and Go. Recently, Petr Danecek has extended
BGZF with an extra index file to achieve random access with offset in
uncompressed file.
In the analysis of high-throughput sequencing data, BGZF plays a crucial role in reducing the storage cost while maintaining the easy accessibility to the data. It is a proven technology scaled to TB of data.
Recommend
-
97
phpbash - A semi-interactive PHP shell compressed into a single file.
-
42
Optimizing web applications is important because more economic web applications consume less CPU cycles and need less bandwidth – resources we have to pay for. It’s easy to turn on
-
11
Node.js 资源压缩 zlib模块一般情况下Node.js使用zlib模块的使用gzip()压缩,但有一个坏处是,大文件会使V8缓冲区爆掉,原因是由于gzip()使用缓存,而V8的缓存区最大...
-
13
Copy link Contributor Logarithmus c...
-
11
如何在 Ubuntu Linux 上安装 Zlib | Linux 中国如果你尝试在 Ubuntu 上安装 zlib,它会抛出 “unable to locate package zlib” 错误。来源:
-
11
Parsing compressed files efficiently with Rust2022-01-06rustI recently wanted to create a tool to create plots showing concurrent players each day on...
-
8
在Mac上安装了Parallels Desktop,然后安装了ubuntu16虚拟机,虚拟机中在用pyenv安装不同版本python的时候,最后失败,提示如下(部分): WARNING: The Python readline extension was not compiled. Missing the GNU readline li...
-
7
Faster zlib/DEFLATE decompression on the Apple M1 (and x86) Posted on ...
-
5
Python数据压缩和存档——zlib/gzip/bzip2/lzma/zip/tar 推荐 原创 開心的猫 2022-12-...
-
3
Loading data into Oracle directly from compressed or enrcypted files Posted on
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK