Usually when we assemble PE (100, 150, 250, 300) fastq Illumina data we see that roughly the expected theoretical coverage is close to the actual coverage over de novo assembled contigs using for instance SPAdes.
However I currently have an older dataset PE 76->51 bases with a theoretical coverage of over 200 fold. However when I inspect the assembly the average coverage is only ~5 fold....
I checked;
- reads are of top quality (1.9 encoding) all far above Q36 (according to fastQC)
- used custom kmer settings for SPAdes in the range 21 -> 49 in PE mode
- SUM length of assembled contigs is approximately the expected genome size ~900k
- There are some repeats with high (but not extreme high) coverage.
For a contaminant I would expect to find many more small low coverage contigs.... as we usually for other species do...
Is the high AT% of this species disregulating SPAdes?
Any pointers where to look at are welcome!
What does the k-mer spectra look like? Do you see an extremely high peak at low coverage?
Is your data set heterozygous/from pooled samples...? And how high is AT rich?