

auN: a new metric to measure assembly contiguity
source link: http://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

auN: a new metric to measure assembly contiguity
Given a de novo assembly, we often measure the “average” contig length by N50. N50 is neither the real average nor median. It is the length of the contig such that this and longer contigs cover at least 50% of the assembly. A longer N50 indicates better contiguity. We can similarly define Nx such that contigs no shorter than Nx covers x% of the assembly. The Nx curve plots Nx as a function of x, where x is ranged from 0 to 100.
In my opinion, there are two problems with N50. First, N50 is not contiguous. For a good human assembly, contigs of lengths around N50 can differ by several megabases in length. Discarding tiny contigs may lead a big jump in N50. Relatedly, between two assemblies, a more contiguous assembly might happen to have a smaller N50 just by chance. Second, N50 may not reflect some improvements to the assembly. If we connect two contigs longer than N50 or connect two contigs shorter than N50, N50 is not changed; N50 is only improved if we connect a contig shorter than N50 and a contig longer than N50. If we assembler developers solely target N50, we may be misled by it.
Here is an idea about how to overcome the two issues. N50 is a single point on the Nx curve. The entire Nx curve in fact gives us a better sense of contiguity. The following figure from a NIBS workshop shows a good example:
Notably, the NG50 (similar to N50) of several assemblers/settings are about the same around 300kb, but it is clear the black curve achieves better contiguity – a single contig on that curve covers more than 40% of the assembly. Intuitively, a better Nx curve is “higher”, or has a larger area under the curve. Then we can take the area under the curve, abbreviated as “auN”, as a measurement of contiguity. The formula to calculate the area is:
where LiLi is the length of contig ii. Although auN is inspired by the Nx curve, its calculation actually doesn’t require to sort contigs by their lengths. It is easier to calculate in practice. For multiple human assemblies at my hand, auN falls between N50 and N40, though this observation doesn’t hold for other assemblies in general.
auN doesn’t have the two problems with N50. It is more stable and less affected by big jumps in contig lengths. It considers the entire Nx curve. Connecting two contigs of any lengths will always lead to a longer auN. If we want to summarizes contig contiguity with a single number, auN is a better choice than N50. Similarly we can define auNG and auNGA. I don’t think auN will be widely used given the inertia on N50, but it is anyway fun to ponder new metrics.
Update: Gregory Concepcion pointed out that the GAGE benchmark was using the same metric to evaluate assemblies, though the authors were not interpreting it as area under the Nx cure. Ivan Sovic and Jens-Uwe Ulrich have independently come up with auN as well.
Recommend
-
12
Understanding Assembly LanguageUnderstanding Assembly Language: a new title for RE4B What is with two titles? The book was named "Reverse Engineering for Beginners" in 2014-2018, but I always suspected this makes readership...
-
6
July 19, 2021 ...
-
7
Introducing a New and Improved SaaS Metric: Weighted ACVACV, or average/annual contract value, isn't as useful a concept as people think. Though it's easy to forget, ACV is a customer-weighted metric: it tells us some...
-
8
-
3
Web 3 needs a new success metric: Crypto Market Fit (CMF) about an hour ago ...
-
11
December 15, 2022 ...
-
5
We need a new way to measure AI security Tl;dr: Trail of Bits has launched a practice focused on machine learning and artificial intelligence, bringing together safety...
-
6
Sounding out a new way to measure gas flow by National Institute...
-
12
Researcher develops new way to measure defects in perovskite by Eindhoven Un...
-
5
Online brand strength is calculated as a score between 1-100. The metric is meant to capture broad signals of an brand's influence.
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK