Question: Genome assembly statistical tools
0
gravatar for margab
16 months ago by
margab10
margab10 wrote:

Does anyone know any tools for calculating assembly statistics such as N50, L50,assembly size, number of contigs/scaffolds and GC%? Thanks in advance!

statistics tools assembly • 1.6k views
ADD COMMENTlink modified 15 months ago by colindaven2.5k • written 16 months ago by margab10
1

You just have to google it and you will find the number of options out there.

Like I did and found quast.

ADD REPLYlink written 16 months ago by Nitin Narwade450
2
gravatar for Buffo
16 months ago by
Buffo1.8k
Buffo1.8k wrote:

You can use Biopieces: read_fasta your_assembly.fasta | analyze_assembly -x

ADD COMMENTlink written 16 months ago by Buffo1.8k
1
gravatar for Juke34
16 months ago by
Juke344.9k
Sweden
Juke344.9k wrote:

From the GAAS toolkit you can use a simple perl script:

conda install -c bioconda gaas
gaas_fasta_statistics.pl -f genome.fa

You get that kind of output:

--------------------------------------------------------------------------------
|                  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa                 |
|                Analysis launched the 04/29/2020 at 09h32m36s                 |
|------------------------------------------------------------------------------|
| Nb of sequences                                         |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >1kb                                    |          7         |
|------------------------------------------------------------------------------|
| Nb of sequences >10kb                                   |          7         |
|------------------------------------------------------------------------------|
| Nb of nucleotides (counting Ns)                         |      119667750     |
|------------------------------------------------------------------------------|
| Nb of nucleotides U                                     |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences with U nucleotides                      |          0         |
|------------------------------------------------------------------------------|
| Nb of IUPAC nucleotides                                 |        469         |
|------------------------------------------------------------------------------|
| Nb of sequences with IUPAC nucleotides                  |          4         |
|------------------------------------------------------------------------------|
| Nb of Ns                                                |       185738       |
|------------------------------------------------------------------------------|
| Nb of internal N-regions (possibly links between contigs)|        159        |
|------------------------------------------------------------------------------|
| Nb of long internal N-regions >10000                    |                    |
| /!\ This is problematic for Genemark                    |          4         |
|------------------------------------------------------------------------------|
| Nb of pure (only) N sequences                           |          0         |
|------------------------------------------------------------------------------|
| Nb of sequences that begin or end with Ns               |          3         |
|------------------------------------------------------------------------------|
| GC-content (%)                                          |        36.0        |
|------------------------------------------------------------------------------|
| GC-content not counting Ns(%)                           |        36.1        |
|------------------------------------------------------------------------------|
| Nb of sequences with lowercase nucleotides              |          0         |
|------------------------------------------------------------------------------|
| Nb of lowercase nucleotides                             |          0         |
|------------------------------------------------------------------------------|
| N50                                                     |      23459830      |
|------------------------------------------------------------------------------|
| L50                                                     |          3         |
|------------------------------------------------------------------------------|
| N90                                                     |      18585056      |
|------------------------------------------------------------------------------|
| L90                                                     |          5         |
|------------------------------------------------------------------------------|
This result is saved in the <result> directory along with plots in <pdf> format.
ADD COMMENTlink modified 7 months ago • written 16 months ago by Juke344.9k
1
gravatar for erwan.scaon
16 months ago by
erwan.scaon810
Nantes - France
erwan.scaon810 wrote:

You should have a look at SQUAT

ADD COMMENTlink written 16 months ago by erwan.scaon810
1
gravatar for biobiu
16 months ago by
biobiu120
United States
biobiu120 wrote:

Try CheckM. Notice that their strain heterogeneity estimation might be a bit confusing ( for me it is counter intuitive).

ADD COMMENTlink written 16 months ago by biobiu120

When referring to a package please provide a link for it. There can be multiple packages with similar names and that can lead to confusion.

ADD REPLYlink written 15 months ago by GenoMax92k
1
gravatar for colindaven
15 months ago by
colindaven2.5k
Hannover Medical School
colindaven2.5k wrote:

I'm a big fan of stats.sh which is from Brian Bushnells' bbtools package, installable via bioconda.

Useful for long read FASTQ length assessments and assembly contig and scaffold stats.

stats.sh

Written by Brian Bushnell
Last modified December 7, 2017

Description:  Generates basic assembly statistics such as scaffold count, 
N50, L50, GC content, gap percent, etc.  For multiple files, please use
statswrapper.sh.  Works with fasta and fastq only (gzipped is fine).
Please read bbmap/docs/guides/StatsGuide.txt for more information.

Usage:        stats.sh in=<file>

Parameters:
in=file         Specify the input fasta file, or stdin.
gc=file         Writes ACGTN content per scaffold to a file.
gchist=file     Filename to output scaffold gc content histogram.
shist=file      Filename to output cumulative scaffold length histogram.
gcbins=200      Number of bins for gc histogram.
n=10            Number of contiguous Ns to signify a break between contigs.
k=13            Estimate memory usage of BBMap with this kmer length.
minscaf=0       Ignore scaffolds shorter than this.
phs=f           (printheaderstats) Set to true to print total size of headers.
n90=t           (printn90) Print the N/L90 metrics.
extended=f      Print additional metrics such as N/L90 and log sum.
logoffset=1000  Minimum length for calculating log sum.
logbase=2       Log base for calculating log sum.
pdl=f           (printduplicatelines) Set to true to print lines in the 
                scaffold size table where the counts did not change.
n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'
                with 'n_' in formats 3-6.
addname=f       Adds a column for input file name, for formats 3-6.

format=<0-7>    Format of the stats information; default 1.
        format=0 prints no assembly stats.
        format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.
        format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.
        format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.
        format=4 is like 3 but with scaffold data only.
        format=5 is like 3 but with contig data only.
        format=6 is like 3 but the header starts with a #.
        format=7 is like 1 but only prints contig info.

gcformat=<0-4>  Select GC output format; default 1.
        gcformat=0:     (no base content info printed)
        gcformat=1:     name    length  A       C       G       T       N       GC
        gcformat=2:     name    GC
        gcformat=4:     name    length  GC
        Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
ADD COMMENTlink written 15 months ago by colindaven2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1484 users visited in the last hour