Question: Genome assembly statistical tools
gravatar for margab
8 months ago by
margab10 wrote:

Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!

statistics tools assembly • 626 views
ADD COMMENTlink modified 8 months ago by colindaven2.1k • written 8 months ago by margab10

You just have to google it and you will find the number of options out there.

Like I did and found quast.

ADD REPLYlink written 8 months ago by Nitin Narwade440
gravatar for Buffo
8 months ago by
Buffo1.8k wrote:

You can use Biopieces: read_fasta your_assembly.fasta | analyze_assembly -x

ADD COMMENTlink written 8 months ago by Buffo1.8k
gravatar for Juke-34
8 months ago by
Juke-343.7k wrote:

From the GAAS toolkit you can use a simple perl script: -f genome.fa

You get that kind of output:

There are 1879 sequences
There are 980 sequences > 10kb 
There are 1877 sequences > 1kb 
There are 78770088 nucleotides, of which 11276 are Ns
There are 2196 N-regions (possibly links between contigs)
There are 0 pure (only) N sequences. Assembler doing that must be notified ! 
There are 0 sequences that begin or end with Ns (see problem_sequences.txt)
The GC-content is 31.4% (not counting Ns 31.4%)
There are 1878 sequences with lowercase nucleotides (Ns not considered)
There are 26012854 lowercase nucleotides (Ns not considered)
The N50 is 121247
The N90 is 25683
The N50 for sequences over 1000bp is 121247
The N50 for sequeces over 10000bp is 128954
ADD COMMENTlink modified 5 weeks ago • written 8 months ago by Juke-343.7k
gravatar for erwan.scaon
8 months ago by
Nantes - France
erwan.scaon750 wrote:

You should have a look at SQUAT

ADD COMMENTlink written 8 months ago by erwan.scaon750
gravatar for biobiu
8 months ago by
United States
biobiu110 wrote:

Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).

ADD COMMENTlink written 8 months ago by biobiu110

When referring to a package please provide a link for it. There can be multiple packages with similar names and that can lead to confusion.

ADD REPLYlink written 8 months ago by genomax80k
gravatar for colindaven
8 months ago by
Hannover Medical School
colindaven2.1k wrote:

I'm a big fan of which is from Brian Bushnells' bbtools package, installable via bioconda.

Useful for long read FASTQ length assessments and assembly contig and scaffold stats.

Written by Brian Bushnell
Last modified December 7, 2017

Description:  Generates basic assembly statistics such as scaffold count, 
N50, L50, GC content, gap percent, etc.  For multiple files, please use  Works with fasta and fastq only (gzipped is fine).
Please read bbmap/docs/guides/StatsGuide.txt for more information.

Usage: in=<file>

in=file         Specify the input fasta file, or stdin.
gc=file         Writes ACGTN content per scaffold to a file.
gchist=file     Filename to output scaffold gc content histogram.
shist=file      Filename to output cumulative scaffold length histogram.
gcbins=200      Number of bins for gc histogram.
n=10            Number of contiguous Ns to signify a break between contigs.
k=13            Estimate memory usage of BBMap with this kmer length.
minscaf=0       Ignore scaffolds shorter than this.
phs=f           (printheaderstats) Set to true to print total size of headers.
n90=t           (printn90) Print the N/L90 metrics.
extended=f      Print additional metrics such as N/L90 and log sum.
logoffset=1000  Minimum length for calculating log sum.
logbase=2       Log base for calculating log sum.
pdl=f           (printduplicatelines) Set to true to print lines in the 
                scaffold size table where the counts did not change.
n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'
                with 'n_' in formats 3-6.
addname=f       Adds a column for input file name, for formats 3-6.

format=<0-7>    Format of the stats information; default 1.
        format=0 prints no assembly stats.
        format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
        format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
        format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
        format=4 is like 3 but with scaffold data only.
        format=5 is like 3 but with contig data only.
        format=6 is like 3 but the header starts with a #.
        format=7 is like 1 but only prints contig info.

gcformat=<0-4>  Select GC output format; default 1.
        gcformat=0:     (no base content info printed)
        gcformat=1:     name    length  A       C       G       T       N       GC
        gcformat=2:     name    GC
        gcformat=4:     name    length  GC
        Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

Please contact Brian Bushnell at if you encounter any problems.
ADD COMMENTlink written 8 months ago by colindaven2.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1851 users visited in the last hour