Assembly statistics: ABySS & BBMap
0
0
Entering edit mode
4.5 years ago
el97004 ▴ 80

Hi! I have noticed some differences in resulting assembly statistics from Abyss and BBMap stats.sh and was wondering if anyone knew why. For example, this is an output I get from Abyss:

n    n:500  L50  min    N80   N50    N20    E-size   max      sum  name
3854  1282  71  500    2119  17327  40129   23269   95498   4735231 unitigs.fa
3448   997  78  500   10022  27708  46504   31492   108468  6954249 contigs.fa
3367   945  70  500   12423  30013  61035   35301   108468  6952419 scaffolds.fa

If it take the scaffolds.fa file and run BBMap stats.sh on it: stats.sh in=scaffolds.fa Here are the resulting values from bbmap:

Main genome scaffold total:             3367
Main genome contig total:               3391
Main genome scaffold sequence total:    7.679 MB
Main genome contig sequence total:      7.678 MB    0.020% gap
Main genome scaffold N/L50:             82/27.296 KB
Main genome contig N/L50:               88/24.476 KB
Main genome scaffold N/L90:             864/549
Main genome contig N/L90:               891/547
Max scaffold length:                    108.468 KB
Max contig length:                      108.468 KB
Number of scaffolds > 50 KB:            24
% main genome in scaffolds > 50 KB:     22.09%


Minimum     Number          Number          Total           Total           Scaffold
Scaffold    of              of              Scaffold        Contig          Contig  
Length      Scaffolds       Contigs         Length          Length          Coverage
--------    --------------  --------------  --------------  --------------  --------
    All              3,367           3,391       7,679,180       7,677,680    99.98%
    100              3,367           3,391       7,679,180       7,677,680    99.98%
    250              2,878           2,902       7,594,310       7,592,810    99.98%
    **500              945             969       6,953,926       6,952,429    99.98%**
   1 KB                415             439       6,574,059       6,572,562    99.98%
 2.5 KB                327             350       6,421,782       6,420,460    99.98%
   5 KB                250             270       6,154,098       6,153,098    99.98%
  10 KB                186             201       5,680,411       5,679,661    99.99%
  25 KB                 85              96       3,920,558       3,920,008    99.99%
  50 KB                 24              32       1,696,196       1,695,796    99.98%
 100 KB                  2               2         211,268         211,268   100.00%

As you can see, the contig and scaffold N50s/L50s are close but not identical. In addition, the total scaffold/contig lengths (for minimum scaffold length=500, Abyss uses a minimum of 500bp) are close but not identical. Has anyone seen this before and can shed some light?

Thank you.

assembly abyss bbmap • 3.5k views
ADD COMMENT
2
Entering edit mode

ABySS uses a specific approach to calculate those stats (as pointed out here already).

There are a few issues on this topic on the abyss github repo, eg:

For searching use the term "abyss-fac" as this the tool/step from the abyss pipeline that does the actual calculations

ADD REPLY
1
Entering edit mode

Keep in mind that, by default, BBmap stats.sh requires at least 10 consecutive Ns between two contigs to consider it a scaffold. Also, stats.sh is likely considering all contigs/scaffolds to calculate N/L50. Usually, contigs shorter than 250 or 500 bp are remove from draft assemblies, and I think you should not consider them to calculate assembly statistics.

ADD REPLY
0
Entering edit mode

Thanks for your reply, alex.zaccaron. I think I know how to solve the second item you mention (I will filter for scaffolds > 500 bp and re reun BBmap stats.sh), for the first item, do you know what I should modify this value to in BBmap stats.sh so that it matches that of abyss?

ADD REPLY
0
Entering edit mode

You can change the parameter n within stats.s to adjust the required number of contiguous Ns in order to consider the sequence a scaffold instead of a contig. For example, if you specify n=1 then stats.sh will "break" a contig at every single N. I am not sure what ABySS considers, but has to be between 1 and 10. You could run stats.sh a few times with different values of n to see when it reports the same number of contigs as ABySS.

ADD REPLY
0
Entering edit mode

Thanks! I tried all values of n between 1-10 but unfortunately cannot get the same number of contigs as in abyss (=997), the closest I got was 970 at n=1.

Edit: I wonder if the contig statistics are off because I am using the scaffolds file as input to BBmap stats.sh..what if this is the only file that one has?

ADD REPLY

Login before adding your answer.

Traffic: 2041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6