Question

Genome assembly statistical tools

0

Entering edit mode

4.7 years ago

margab ▴ 10

Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!

Assembly statistics • 11k views

ADD COMMENT • link updated 10 days ago by skdv2522 • 0 • written 4.7 years ago by margab ▴ 10

2

Entering edit mode

You just have to google it and you will find the number of options out there.

Like I did and found quast.

ADD REPLY • link 4.7 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Can anyone suggest me which tool can be used to find the total gap percentage in assembled genome file?

ADD REPLY • link 10 days ago by skdv2522 • 0

score 4 · Answer 1 · 2019-08-08

From the GAAS toolkit you can use a simple perl script:

conda install -c bioconda gaas
gaas_fasta_statistics.pl -f genome.fa

You get that kind of output:

--------------------------------------------------------------------------------
|                  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa                 |
|                Analysis launched the 04/29/2020 at 09h32m36s                 |
|------------------------------------------------------------------------------|
| Nb of sequences                                         |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >1kb                                    |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >10kb                                   |          7         |
|------------------------------------------------------------------------------|
| Nb of nucleotides (counting Ns)                         |      119667750     |
|------------------------------------------------------------------------------|
| Nb of nucleotides U                                     |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences with U nucleotides                      |          0         |
|------------------------------------------------------------------------------|
| Nb of IUPAC nucleotides                                 |        469         |
|------------------------------------------------------------------------------|
| Nb of sequences with IUPAC nucleotides                  |          4         |
|------------------------------------------------------------------------------|
| Nb of Ns                                                |       185738       |
|------------------------------------------------------------------------------|
| Nb of internal N-regions (possibly links between contigs)|        159        |
|------------------------------------------------------------------------------|
| Nb of long internal N-regions >10000                    |                    |
| /!\ This is problematic for Genemark                    |          4         |
|------------------------------------------------------------------------------|
| Nb of pure (only) N sequences                           |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences that begin or end with Ns               |          3         |
|------------------------------------------------------------------------------|
| GC-content (%)                                          |        36.0        |
|------------------------------------------------------------------------------|
| GC-content not counting Ns(%)                           |        36.1        |
|------------------------------------------------------------------------------|
| Nb of sequences with lowercase nucleotides              |          0         |
|------------------------------------------------------------------------------|
| Nb of lowercase nucleotides                             |          0         |
|------------------------------------------------------------------------------|
| N50                                                     |      23459830      |
|------------------------------------------------------------------------------|
| L50                                                     |          3         |
|------------------------------------------------------------------------------|
| N90                                                     |      18585056      |
|------------------------------------------------------------------------------|
| L90                                                     |          5         |
|------------------------------------------------------------------------------|
This result is saved in the <result> directory along with plots in <pdf> format.

score 2 · Answer 2 · 2019-08-08

2

Entering edit mode

4.7 years ago

Buffo ★ 2.4k

You can use Biopieces: read_fasta your_assembly.fasta | analyze_assembly -x

ADD COMMENT • link 4.7 years ago by Buffo ★ 2.4k

score 1 · Answer 3 · 2019-08-08

1

Entering edit mode

4.7 years ago

erwan.scaon ▴ 930

You should have a look at SQUAT

ADD COMMENT • link 4.7 years ago by erwan.scaon ▴ 930

score 1 · Answer 4 · 2019-08-08

1

Entering edit mode

4.7 years ago

biobiu ▴ 150

Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).

ADD COMMENT • link 4.7 years ago by biobiu ▴ 150

0

Entering edit mode

When referring to a package please provide a link for it. There can be multiple packages with similar names and that can lead to confusion.

ADD REPLY • link 4.7 years ago by GenoMax 141k

score 1 · Answer 5 · 2019-08-13

I'm a big fan of stats.sh which is from Brian Bushnells' bbtools package, installable via bioconda.

Useful for long read FASTQ length assessments and assembly contig and scaffold stats.

stats.sh

Written by Brian Bushnell
Last modified December 7, 2017

Description:  Generates basic assembly statistics such as scaffold count, 
N50, L50, GC content, gap percent, etc.  For multiple files, please use
statswrapper.sh.  Works with fasta and fastq only (gzipped is fine).
Please read bbmap/docs/guides/StatsGuide.txt for more information.

Usage:        stats.sh in=<file>

Parameters:
in=file         Specify the input fasta file, or stdin.
gc=file         Writes ACGTN content per scaffold to a file.
gchist=file     Filename to output scaffold gc content histogram.
shist=file      Filename to output cumulative scaffold length histogram.
gcbins=200      Number of bins for gc histogram.
n=10            Number of contiguous Ns to signify a break between contigs.
k=13            Estimate memory usage of BBMap with this kmer length.
minscaf=0       Ignore scaffolds shorter than this.
phs=f           (printheaderstats) Set to true to print total size of headers.
n90=t           (printn90) Print the N/L90 metrics.
extended=f      Print additional metrics such as N/L90 and log sum.
logoffset=1000  Minimum length for calculating log sum.
logbase=2       Log base for calculating log sum.
pdl=f           (printduplicatelines) Set to true to print lines in the 
                scaffold size table where the counts did not change.
n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'
                with 'n_' in formats 3-6.
addname=f       Adds a column for input file name, for formats 3-6.

format=<0-7>    Format of the stats information; default 1.
        format=0 prints no assembly stats.
        format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
        format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
        format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
        format=4 is like 3 but with scaffold data only.
        format=5 is like 3 but with contig data only.
        format=6 is like 3 but the header starts with a #.
        format=7 is like 1 but only prints contig info.

gcformat=<0-4>  Select GC output format; default 1.
        gcformat=0:     (no base content info printed)
        gcformat=1:     name    length  A       C       G       T       N       GC
        gcformat=2:     name    GC
        gcformat=4:     name    length  GC
        Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

score 0 · Answer 6 · 2023-06-01

I'll follow up what a few others have mentioned, but I like stats.sh within the BBTools package for raw assembly stats. I'll then polish my assembly and pare it down to the targeted contigs (eg. those annotated as bacteria/viruses/your genome of interest/etc.) then I like to use quast/metaquast to evaluate those contigs/bins/etc. If you have a mock control or reference sequence for evaluation, you can give that to quast and it will show your contig alignment, assembly %, and several other options, as well as let you compare across assemblies, which is invaluable for evaluating your assembly quality.

examples:

stats.sh from BBTools

stats.sh in=contigs.fasta out=assembly_stats.txt gchist=gchist.txt shist=shist.txt score=t

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2576  0.2355  0.2359  0.2709  0.0000  0.0000  0.0000  0.4714  0.1067

Main genome scaffold total:             14645
Main genome contig total:               14645
Main genome scaffold sequence total:    9.225 MB
Main genome contig sequence total:      9.225 MB        0.000% gap
Main genome scaffold N/L50:             2725/901
Main genome contig N/L50:               2725/901
Main genome scaffold N/L90:             9896/303
Main genome contig N/L90:               9896/303
Main genome assembly score:             8.662
Max scaffold length:                    83.033 KB
Max contig length:                      83.033 KB
Number of scaffolds > 50 KB:            2
% main genome in scaffolds > 50 KB:     1.62%


Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                 14,645          14,645       9,224,646       9,224,646   100.00%
    100                 14,645          14,645       9,224,646       9,224,646   100.00%
    250                 11,512          11,512       8,766,752       8,766,752   100.00%
    500                  5,983           5,983       6,788,048       6,788,048   100.00%
   1 KB                  2,336           2,336       4,246,150       4,246,150   100.00%
 2.5 KB                    282             282       1,268,180       1,268,180   100.00%
   5 KB                     39              39         467,306         467,306   100.00%
  10 KB                      9               9         277,465         277,465   100.00%
  25 KB                      3               3         185,822         185,822   100.00%
  50 KB                      2               2         149,603         149,603   100.00%

metaquest from quast

metaquast.py -o metaquast_sc_k121_norm_eval -r /Users/bbrow6/Desktop/InFANT/Viral_metagenome/assembly_eval/ATCC_MSA_viral_genomes/complete_fastas --gene-finding --min-identity 95 --threads 6 sc_k121_norm__OTUs_gt1000bps.fna

enter image description here