

Format, quality binning and file size
source link: http://lh3.github.io/2020/05/25/format-quality-binning-and-file-sizes
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Format, quality binning and file size
This short post evaluates the effect of format and quality binning on file sizes. I am taking SRR2052362 as an example. It gives 4.3-fold coverage on the human genome. For 2-binning, I turned original quality 20 or above to 30 and turned original quality below 20 to 10. For 8-binning, I took the scheme from a white paper (PDF) published by Illumina. Illumina has been using quality binning for more than seven years. In this experiement, I only retained the original read names. To produce CRAM files, I mapped the short reads to the GRCh38 primary assembly. The following table shows the file sizes:
Format # qual bins Size (GB) Change relative to SRASorted CRAM 2 bins 1.187-85%Unsorted CRAM2 bins 1.279-84%Unsorted CRAM8 bins 2.115-73%Gzip'd FASTA No quality 4.172-47%Unsorted CRAMLossless 4.536-43%Gzip'd FASTQ 2 bins 4.784-40%SRA Lossless 7.917 0%Gzip'd FASTQ Lossless 9.210+16%
It is clear that the CRAM format is the winner here and the advantage of CRAM is more prominent given lower quality resolution. A key question is how much quality binning affects variant calling. Brad Chapman concluded 8-binning had little effect on variant calling accuracy. With Crumble, James Bonfield could get a little higher accuracy with lossy compression. FermiKit effectively uses 2-binning and can achieve descent results. I applied 2-binning to GATK many years ago and observed 2-binning barely reduced accuracy. The GATK team at Broad Institute also evaluated 2-binning and 4-binning. They found 4-binning was better than 2-binning and was as good as original quality. The overall message is that we don’t need full quality resolution to make accurate variant calls for germline samples. The effect on tumor samples is more of an open question, though.
It is worth noting that completely discarding base quality dramatically reduces variant calling accuracy. I have observed this both with FermiKit and with GATK (I didn’t keep the results unfortunately). This is because low-quality Illumina sequencing errors are correlated, in that if one low-quality base is wrong, other low-quality bases tend to be wrong in the same way. Without base quality, variant callers wouldn’t be able to identify such recurrent errors.
Recommend
-
18
(This article was first published on S+/R – Yet Another Blog in Statistical Computing
-
35
SQLite As An Application File Format Executive Summary An SQLite database file with a defined schema often makes an excellent application file format. Here are a dozen reasons why this is so:
-
12
Reverse Engineering Instruments’ File Format Have you ever wondered how applications store their data? Plenty of file formats like MP3 and JP...
-
16
Format byte size as kilobytes, megabytes, gigabytes, ... yourbasic.org/golang These utility functions...
-
12
Ryan Smith on Twitter: "RT @anandtech: Making fast desktop processors is in the design, the manufacturing, and the binning. But where to draw the line for binning?…"Don’t miss what’s happeningPeople on Twitter are the first to kn...
-
6
Distributing CI: Binning and Distributed Task ExecutionAs your Nx workspaces grow, running CI on a single agent becomes unworkable. Nx’s code change analysis and computation caching allows y...
-
3
Explainer: What is Chip Binning? Hitting the Silicon Lottery Jackpot By
-
16
What is pixel binning, and what does it mean for your mobile photography? By Taylor Kerns Published 1 day ago Megap...
-
7
How to Compress a Video and Reduce the File Size By Mahesh Makvana Updated 9 hour...
-
8
Size is the best predictor of code qualityAfter my
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK