Question: Average coverage of de novo SPAdes assembly deviates a lot from theretical coverage
0
gravatar for ALchEmiXt
3.5 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

Usually when we assemble PE (100, 150, 250, 300) fastq Illumina data we see that roughly the expected theoretical coverage is close to the actual coverage over de novo assembled contigs using for instance SPAdes.

However I currently have an older dataset PE 76->51 bases with a theoretical coverage of over 200 fold. However when I inspect the assembly the average coverage is only ~5 fold....

I checked;

  1. reads are of top quality (1.9 encoding) all far above Q36 (according to fastQC)
  2. used custom kmer settings for SPAdes in the range 21 -> 49 in PE mode
  3. SUM length of assembled contigs is approximately the expected genome size ~900k
  4. There are some repeats with high (but not extreme high) coverage.

For a contaminant I would expect to find many more small low coverage contigs.... as we usually for other species do...

Is the high AT% of this species disregulating SPAdes?

Any pointers where to look at are welcomed!

 

 

ADD COMMENTlink modified 9 months ago by Biostar ♦♦ 20 • written 3.5 years ago by ALchEmiXt1.9k

What does the k-mer spectra look like? Do you see an extremely high peak at low coverage? 

ADD REPLYlink written 3.5 years ago by Damian Kao15k

Is your data set heterozygous/from pooled samples...? And how high is AT rich?

ADD REPLYlink written 3.5 years ago by thackl2.6k
2
gravatar for ALchEmiXt
3.4 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

To conclude (for those searching for similar questions/answers):

1) Indeed the k-mer coverage is reported. Please give points to the respective comments.

2) Indeed the higher k-mers did not significantly improve the assembly. If contig length is a parameter for quality the latest SPAdes 4.6.1 did a better job then 4.6.0.

3) The AT richness (70%) did not influence this.

Thanks for solving this basic question.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by ALchEmiXt1.9k
1

Are you sure about SPAdes 4.6.2 ? According to their website as it opens up in my browser, their latest version is 3.6.1? Thanks for reporting back, will be very helpful to others!

ADD REPLYlink written 3.4 years ago by Adrian Pelin2.2k

oepsie. Corrected to 3.6.1

ADD REPLYlink written 3.4 years ago by ALchEmiXt1.9k

Please note: This answer is just to have the question answered. Not to score points. Please give points to the respective comments pointing out the summarised answer.

ADD REPLYlink written 3.4 years ago by ALchEmiXt1.9k
2
gravatar for thackl
3.5 years ago by
thackl2.6k
MIT
thackl2.6k wrote:

My guess, it is related to your very short reads and the fact that SPAdes runs with increasing kmer size. Coverage is probably estimated in the final iteration using k55 or larger and your reads each only contain a few good 55-mers.

EDIT: Just saw that you used 49mers - then this answer probably does not make sense..

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by thackl2.6k
3

This answer does make sense. I think the OP is confusing Nucleotide Coverage, which computes average coverage for each base pair with k-mer coverage. The latter is what is reported by SPAdes based on the last k chosen, in this case 49. k-mer coverage is always smaller than nucleotide coverage. This is from Velvet, but it also applies to SPAdes:

All coverage values in Velvet are provided in k-mer coverage, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C * (L - k + 1) / L where k is your hash length, and L you read length.

ADD REPLYlink written 3.5 years ago by Adrian Pelin2.2k
1

Yeah, I had the same formula in mind. However, with 76 bp reads and 49-mers, you still would expect a lot more then 5X, even if you account for a large portion of reads having bad tails... It probably is contributing factor though. A good way to check probably would be to compute a kmer-profile with 49mers..

read length 76: 200 * (76-49+1) / 76 = 74
read length 55: 200 * (55-49+1) / 55 = 25

 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by thackl2.6k

Good point. I should have thought about that. It would fit for these cases since it would give me roughly 5 fold coverage x 49 kmer ~ 250x... the expected coverage. Or would that be the wrong way to calculate it? My trimmed corrected reads in SPAdes are mostly 50 bases and my highest k-mer 49.

Strange I didn't notice this with other datasets before. Maybe because coverage was generally very high so this was not a critical point in absolute measures.

Is there anywhere a description of this value in the header? I searched the website and looked through the original paper: http://online.liebertpub.com/doi/pdf/10.1089/cmb.2012.0021 but no luck

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by ALchEmiXt1.9k

If you average is 50 bp after trimming, that means that a significant portion of your reads would be discarded when k=49, as they would be smaller than 49bp. I recommend a smaller final k-mer, perhaps 35 or 41. You will surely notice higher k-mer coverage values.

ADD REPLYlink written 3.5 years ago by Adrian Pelin2.2k

I can bench mark this. But why would that happen since 49 is still lower than 50-51. Most likely a matter of probability?

Anyway my reads are all 51 after adapter trimming. After readcorrection in SPAdes effectively I see many truncated at the first base. The majority (read almost all) reads are 50 bases or longer.

ADD REPLYlink written 3.5 years ago by ALchEmiXt1.9k
2

The header are formatted analogous to Velvet.

http://seqanswers.com/forums/showthread.php?t=51394
http://seqanswers.com/forums/showthread.php?t=1529
http://seqanswers.com/forums/showthread.php?t=6887

And if in fact the majority of your reads is at ~50bp, than the kmer-vs-read coverage would account for the 5X.

Have a look at spades/K49/final_contigs.fasta and spades/<second to last K>/final_contigs.fasta. I'm assuming that the final iteration didn't really improve the assembly, given the low coverage connections in the underlying graph.

 

ADD REPLYlink written 3.5 years ago by thackl2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1853 users visited in the last hour