understanding contig statistics
Entering edit mode
4.6 years ago
mary • 0

I am trying to assemble a bacterial isolate genome from illumina miseq reads. the propose is to go forward for closing the complete genome and use it for comparative genomics (for now, I just have draft genome in 100 contigs). I have gone trough each step of read quality trimming, de novo assembly and scaffolding and then mapped the reads to these scaffolds. then I checked the completeness and contamination of the genome.

I have many problems understanding the basics. anybody can help me through these questions?

1- I get around 115 contigs larger than 500bp, each one with a different coverage, ranging from 4 to 180! basically I know that low coverage means it is not reliable or high coverage ones might be of plasmid origin. but how much coverage is the minimum acceptable or how much coverage is considered as high in my case?

p.s: I have checked the contigs against plasmid databases but no plasmid gene or plasmid ori seqs was detected. same for bacteriophages

2-when I assemble using Spades, a coverage is reported for each contig. when I map reads back to these contigs using qualimap, another coverage is reported by qualimap which is very different from Spades report. e. g. for contig number 99 (the second one in linked image), spades reports a 27x coverage where qualimap reports 75! what's the reason they are so different and which one should I rely on?

3- when I check the mapped contigs, in most cases, both ends of each contig has much lower coverage than the mid part and sometimes it is just mapped by one or few reads. are these contig ends reliable or should I omit (trim) these ends?

coverage, standard deviation and length of contigs as reported by Spades and qualimap

thank you in advance

Assembly coverage spades qualimap mapping reads • 3.7k views
Entering edit mode
4.6 years ago

Mapping and assembly give different values for coverage. Generally, mapping is a more accurate way of evaluating coverage. The coverage reported by Spades is probably the kmer coverage, which is both lower than the read coverage, and also calculated only from perfect matches.

I do not recommend trimming the ends of contigs; there's generally no reason to do so and it will just make your assembly more incomplete. The reason ends have low coverage is probably because the coverage became too low there to continue assembly - e.g., once coverage drops to zero, you can't assemble any more.

There's no specific level of coverage that is "acceptable" in general, but 4 is too low for a good assembly. And for a microbe, max coverage of 180 is also pretty low; I try to target at least 100x average (that's 100x read coverage, not kmer coverage; I'm not sure which one you were describing). Low coverage gives you more assembly errors and lower contiguity.

You are almost certainly not going to create a closed bacterial assembly with a single relatively low-coverage Illumina library; if you want to do so, it's best to run a single PacBio smart cell. You may be able to improve your current assembly - I think we typically end up with Spades/Illumina bacterial assemblies under 50 contigs - but especially with such high coverage variability, it will be difficult, and it also depends on the repeat and GC content of the organism.

Often, you can improve assemblies by error-correcting and merging paired reads. For example, with the BBMap package, I might preprocess and assemble reads like this:

#Trim adapters
bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=/bbmap/resources/adapters.fa tbo tpe

#Remove contaminants
bbduk.sh in=trimmed.fq out=filtered.fq k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz

#Error-correct 1
bbmerge.sh in=filtered.fq out=ecco.fq ecco mix vstrict adapters=default

#Error-correct 2
clumpify.sh in=ecco.fq out=eccc.fq ecc passes=6

#Error-correct 3
tadpole.sh in=eccc.fq out=ecct.fq ecc k=62

bbmerge.sh in=eccc.fq out=merged.fq outu=unmerged.fq rem k=62 extend2=50 adapters=default

spades.py -k 21,41,71,101,127 -o out -12 unmerged.fq -s merged.fq

stats.sh in=/out/scaffolds.fasta

#Quantify coverage
bbmap.sh ref=/out/scaffolds.fasta in=trimmed.fq out=mapped.sam covstats=covstats.txt
Entering edit mode

Hi Brian

Would you advise a different quality filtering/trimming pipeline for going into taxonomic profiling instead of assembly?

Entering edit mode

What do you mean by taxonomic profiling? Is this a 16S metagenomic analysis, or a shotgun metagenome, for example?

Entering edit mode

Hi, I'm starting with genome assemblies and I think my situation is pretty similar to the one of mery. I'm assembling some genomes with an expected length of around 4Mb and after trimming with bbduk and getting nice Fastqc reports, I run spades but I get about 150 contigs. The last of them are quite short (less than 1000 bp), and, if I remove all these short contigs I end up with a 84 contigs assembly. I don't know if this approach is correct or if I should try any other thing.

When following the instructions you gave above, I get 177 contigs, once again if I remove all the small contigs below 1000 bp, I finish with a 85 contigs assembly. Do you know any other pipeline or an additional step to improve the assembly?



Login before adding your answer.

Traffic: 2475 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6