Genome assembly statistical tools
6
0
Entering edit mode
4.1 years ago
margab ▴ 10

Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!

Assembly statistics • 9.4k views
2
Entering edit mode

You just have to google it and you will find the number of options out there.

Like I did and found quast.

4
Entering edit mode
4.1 years ago
Juke34 8.2k

From the GAAS toolkit you can use a simple perl script:

conda install -c bioconda gaas
gaas_fasta_statistics.pl -f genome.fa


You get that kind of output:

--------------------------------------------------------------------------------
|                  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa                 |
|                Analysis launched the 04/29/2020 at 09h32m36s                 |
|------------------------------------------------------------------------------|
| Nb of sequences                                         |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >1kb                                    |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >10kb                                   |          7         |
|------------------------------------------------------------------------------|
| Nb of nucleotides (counting Ns)                         |      119667750     |
|------------------------------------------------------------------------------|
| Nb of nucleotides U                                     |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences with U nucleotides                      |          0         |
|------------------------------------------------------------------------------|
| Nb of IUPAC nucleotides                                 |        469         |
|------------------------------------------------------------------------------|
| Nb of sequences with IUPAC nucleotides                  |          4         |
|------------------------------------------------------------------------------|
| Nb of Ns                                                |       185738       |
|------------------------------------------------------------------------------|
| Nb of internal N-regions (possibly links between contigs)|        159        |
|------------------------------------------------------------------------------|
| Nb of long internal N-regions >10000                    |                    |
| /!\ This is problematic for Genemark                    |          4         |
|------------------------------------------------------------------------------|
| Nb of pure (only) N sequences                           |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences that begin or end with Ns               |          3         |
|------------------------------------------------------------------------------|
| GC-content (%)                                          |        36.0        |
|------------------------------------------------------------------------------|
| GC-content not counting Ns(%)                           |        36.1        |
|------------------------------------------------------------------------------|
| Nb of sequences with lowercase nucleotides              |          0         |
|------------------------------------------------------------------------------|
| Nb of lowercase nucleotides                             |          0         |
|------------------------------------------------------------------------------|
| N50                                                     |      23459830      |
|------------------------------------------------------------------------------|
| L50                                                     |          3         |
|------------------------------------------------------------------------------|
| N90                                                     |      18585056      |
|------------------------------------------------------------------------------|
| L90                                                     |          5         |
|------------------------------------------------------------------------------|
This result is saved in the <result> directory along with plots in <pdf> format.

2
Entering edit mode
4.1 years ago
Buffo ★ 2.3k

You can use Biopieces: read_fasta your_assembly.fasta | analyze_assembly -x

1
Entering edit mode
4.1 years ago
erwan.scaon ▴ 920

You should have a look at SQUAT

1
Entering edit mode
4.1 years ago
biobiu ▴ 140

Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).

0
Entering edit mode

When referring to a package please provide a link for it. There can be multiple packages with similar names and that can lead to confusion.

1
Entering edit mode
4.1 years ago

I'm a big fan of stats.sh which is from Brian Bushnells' bbtools package, installable via bioconda.

Useful for long read FASTQ length assessments and assembly contig and scaffold stats.

stats.sh

Written by Brian Bushnell

Description:  Generates basic assembly statistics such as scaffold count,
N50, L50, GC content, gap percent, etc.  For multiple files, please use
statswrapper.sh.  Works with fasta and fastq only (gzipped is fine).

Usage:        stats.sh in=<file>

Parameters:
in=file         Specify the input fasta file, or stdin.
gc=file         Writes ACGTN content per scaffold to a file.
gchist=file     Filename to output scaffold gc content histogram.
shist=file      Filename to output cumulative scaffold length histogram.
gcbins=200      Number of bins for gc histogram.
n=10            Number of contiguous Ns to signify a break between contigs.
k=13            Estimate memory usage of BBMap with this kmer length.
minscaf=0       Ignore scaffolds shorter than this.
n90=t           (printn90) Print the N/L90 metrics.
extended=f      Print additional metrics such as N/L90 and log sum.
logoffset=1000  Minimum length for calculating log sum.
logbase=2       Log base for calculating log sum.
pdl=f           (printduplicatelines) Set to true to print lines in the
scaffold size table where the counts did not change.
n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'
with 'n_' in formats 3-6.

format=<0-7>    Format of the stats information; default 1.
format=0 prints no assembly stats.
format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
format=4 is like 3 but with scaffold data only.
format=5 is like 3 but with contig data only.
format=6 is like 3 but the header starts with a #.
format=7 is like 1 but only prints contig info.

gcformat=<0-4>  Select GC output format; default 1.
gcformat=0:     (no base content info printed)
gcformat=1:     name    length  A       C       G       T       N       GC
gcformat=2:     name    GC
gcformat=4:     name    length  GC
Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.


0
Entering edit mode
3 months ago
Bryan ▴ 10

I'll follow up what a few others have mentioned, but I like stats.sh within the BBTools package for raw assembly stats. I'll then polish my assembly and pare it down to the targeted contigs (eg. those annotated as bacteria/viruses/your genome of interest/etc.) then I like to use quast/metaquast to evaluate those contigs/bins/etc. If you have a mock control or reference sequence for evaluation, you can give that to quast and it will show your contig alignment, assembly %, and several other options, as well as let you compare across assemblies, which is invaluable for evaluating your assembly quality.

examples:

stats.sh from BBTools

stats.sh in=contigs.fasta out=assembly_stats.txt gchist=gchist.txt shist=shist.txt score=t

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2576  0.2355  0.2359  0.2709  0.0000  0.0000  0.0000  0.4714  0.1067

Main genome scaffold total:             14645
Main genome contig total:               14645
Main genome scaffold sequence total:    9.225 MB
Main genome contig sequence total:      9.225 MB        0.000% gap
Main genome scaffold N/L50:             2725/901
Main genome contig N/L50:               2725/901
Main genome scaffold N/L90:             9896/303
Main genome contig N/L90:               9896/303
Main genome assembly score:             8.662
Max scaffold length:                    83.033 KB
Max contig length:                      83.033 KB
Number of scaffolds > 50 KB:            2
% main genome in scaffolds > 50 KB:     1.62%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
All                 14,645          14,645       9,224,646       9,224,646   100.00%
100                 14,645          14,645       9,224,646       9,224,646   100.00%
250                 11,512          11,512       8,766,752       8,766,752   100.00%
500                  5,983           5,983       6,788,048       6,788,048   100.00%
1 KB                  2,336           2,336       4,246,150       4,246,150   100.00%
2.5 KB                    282             282       1,268,180       1,268,180   100.00%
5 KB                     39              39         467,306         467,306   100.00%
10 KB                      9               9         277,465         277,465   100.00%
25 KB                      3               3         185,822         185,822   100.00%
50 KB                      2               2         149,603         149,603   100.00%


metaquest from quast

metaquast.py -o metaquast_sc_k121_norm_eval -r /Users/bbrow6/Desktop/InFANT/Viral_metagenome/assembly_eval/ATCC_MSA_viral_genomes/complete_fastas --gene-finding --min-identity 95 --threads 6 sc_k121_norm__OTUs_gt1000bps.fna