Average coverage of de novo SPAdes assembly deviates a lot from theretical coverage
2
0
Entering edit mode
6.7 years ago
ALchEmiXt ★ 1.9k

Usually when we assemble PE (100, 150, 250, 300) fastq Illumina data we see that roughly the expected theoretical coverage is close to the actual coverage over de novo assembled contigs using for instance SPAdes.

However I currently have an older dataset PE 76->51 bases with a theoretical coverage of over 200 fold. However when I inspect the assembly the average coverage is only ~5 fold....

I checked;

1. reads are of top quality (1.9 encoding) all far above Q36 (according to fastQC)
2. used custom kmer settings for SPAdes in the range 21 -> 49 in PE mode
3. SUM length of assembled contigs is approximately the expected genome size ~900k
4. There are some repeats with high (but not extreme high) coverage.

For a contaminant I would expect to find many more small low coverage contigs.... as we usually for other species do...

Is the high AT% of this species disregulating SPAdes?

Any pointers where to look at are welcomed!

SPAdes Assembly de novo bacterial species • 5.6k views
0
Entering edit mode

What does the k-mer spectra look like? Do you see an extremely high peak at low coverage?

0
Entering edit mode

Is your data set heterozygous/from pooled samples...? And how high is AT rich?

2
Entering edit mode
6.7 years ago
ALchEmiXt ★ 1.9k

To conclude (for those searching for similar questions/answers):

1) Indeed the k-mer coverage is reported. Please give points to the respective comments.

2) Indeed the higher k-mers did not significantly improve the assembly. If contig length is a parameter for quality the latest SPAdes 4.6.1 did a better job then 4.6.0.

3) The AT richness (70%) did not influence this.

Thanks for solving this basic question.

1
Entering edit mode

Are you sure about SPAdes 4.6.2 ? According to their website as it opens up in my browser, their latest version is 3.6.1? Thanks for reporting back, will be very helpful to others!

0
Entering edit mode

oepsie. Corrected to 3.6.1

0
Entering edit mode

2
Entering edit mode
6.7 years ago
thackl ★ 2.9k

My guess, it is related to your very short reads and the fact that SPAdes runs with increasing kmer size. Coverage is probably estimated in the final iteration using k55 or larger and your reads each only contain a few good 55-mers.

EDIT: Just saw that you used 49mers - then this answer probably does not make sense..

4
Entering edit mode

This answer does make sense. I think the OP is confusing Nucleotide Coverage, which computes average coverage for each base pair with k-mer coverage. The latter is what is reported by SPAdes based on the last k chosen, in this case 49. k-mer coverage is always smaller than nucleotide coverage. This is from Velvet, but it also applies to SPAdes:

All coverage values in Velvet are provided in k-mer coverage, i.e. how many times has a k-mer been seen among the reads. The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is Ck = C * (L - k + 1) / L where k is your hash length, and L you read length.

1
Entering edit mode

Yeah, I had the same formula in mind. However, with 76 bp reads and 49-mers, you still would expect a lot more then 5X, even if you account for a large portion of reads having bad tails... It probably is contributing factor though. A good way to check probably would be to compute a kmer-profile with 49mers..

read length 76: 200 * (76-49+1) / 76 = 74
read length 55: 200 * (55-49+1) / 55 = 25


0
Entering edit mode

Good point. I should have thought about that. It would fit for these cases since it would give me roughly 5 fold coverage x 49 kmer ~ 250x... the expected coverage. Or would that be the wrong way to calculate it? My trimmed corrected reads in SPAdes are mostly 50 bases and my highest k-mer 49.

Strange I didn't notice this with other datasets before. Maybe because coverage was generally very high so this was not a critical point in absolute measures.

Is there anywhere a description of this value in the header? I searched the website and looked through the original paper: http://online.liebertpub.com/doi/pdf/10.1089/cmb.2012.0021 but no luck

0
Entering edit mode

If you average is 50 bp after trimming, that means that a significant portion of your reads would be discarded when k=49, as they would be smaller than 49bp. I recommend a smaller final k-mer, perhaps 35 or 41. You will surely notice higher k-mer coverage values.

0
Entering edit mode

I can bench mark this. But why would that happen since 49 is still lower than 50-51. Most likely a matter of probability?

2
Entering edit mode

The header are formatted analogous to Velvet.

And if in fact the majority of your reads is at ~50bp, than the kmer-vs-read coverage would account for the 5X.

Have a look at spades/K49/final_contigs.fasta and spades/<second to last K>/final_contigs.fasta. I'm assuming that the final iteration didn't really improve the assembly, given the low coverage connections in the underlying graph.