Question: Genome assembly statistical tools
gravatar for margab
12 days ago by
margab10 wrote:

Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!

statistics tools assembly • 178 views
ADD COMMENTlink modified 7 days ago by colindaven1.6k • written 12 days ago by margab10

You just have to google it and you will find the number of options out there.

Like I did and found quast.

ADD REPLYlink written 12 days ago by Nitin Narwade420
gravatar for Buffo
12 days ago by
Buffo1.6k wrote:

You can use Biopieces: read_fasta your_assembly.fasta | analyze_assembly -x

ADD COMMENTlink written 12 days ago by Buffo1.6k
gravatar for Juke-34
12 days ago by
Juke-342.4k wrote:

From the GAAS toolkit you can use a simple perl script:

ADD COMMENTlink written 12 days ago by Juke-342.4k
gravatar for erwan.scaon
12 days ago by
Nantes - France
erwan.scaon720 wrote:

You should have a look at SQUAT

ADD COMMENTlink written 12 days ago by erwan.scaon720
gravatar for biobiu
12 days ago by
United States
biobiu100 wrote:

Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).

ADD COMMENTlink written 12 days ago by biobiu100

When referring to a package please provide a link for it. There can be multiple packages with similar names and that can lead to confusion.

ADD REPLYlink written 7 days ago by genomax70k
gravatar for colindaven
7 days ago by
Hannover Medical School
colindaven1.6k wrote:

I'm a big fan of which is from Brian Bushnells' bbtools package, installable via bioconda.

Useful for long read FASTQ length assessments and assembly contig and scaffold stats.

Written by Brian Bushnell
Last modified December 7, 2017

Description:  Generates basic assembly statistics such as scaffold count, 
N50, L50, GC content, gap percent, etc.  For multiple files, please use  Works with fasta and fastq only (gzipped is fine).
Please read bbmap/docs/guides/StatsGuide.txt for more information.

Usage: in=<file>

in=file         Specify the input fasta file, or stdin.
gc=file         Writes ACGTN content per scaffold to a file.
gchist=file     Filename to output scaffold gc content histogram.
shist=file      Filename to output cumulative scaffold length histogram.
gcbins=200      Number of bins for gc histogram.
n=10            Number of contiguous Ns to signify a break between contigs.
k=13            Estimate memory usage of BBMap with this kmer length.
minscaf=0       Ignore scaffolds shorter than this.
phs=f           (printheaderstats) Set to true to print total size of headers.
n90=t           (printn90) Print the N/L90 metrics.
extended=f      Print additional metrics such as N/L90 and log sum.
logoffset=1000  Minimum length for calculating log sum.
logbase=2       Log base for calculating log sum.
pdl=f           (printduplicatelines) Set to true to print lines in the 
                scaffold size table where the counts did not change.
n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'
                with 'n_' in formats 3-6.
addname=f       Adds a column for input file name, for formats 3-6.

format=<0-7>    Format of the stats information; default 1.
        format=0 prints no assembly stats.
        format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
        format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
        format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
        format=4 is like 3 but with scaffold data only.
        format=5 is like 3 but with contig data only.
        format=6 is like 3 but the header starts with a #.
        format=7 is like 1 but only prints contig info.

gcformat=<0-4>  Select GC output format; default 1.
        gcformat=0:     (no base content info printed)
        gcformat=1:     name    length  A       C       G       T       N       GC
        gcformat=2:     name    GC
        gcformat=4:     name    length  GC
        Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

Please contact Brian Bushnell at if you encounter any problems.
ADD COMMENTlink written 7 days ago by colindaven1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1436 users visited in the last hour