First post to biostars... hope i do this right..
Hi All,
I've been fighting a bit with a an attempt at a genome assembly.
Based on reading this forum, I suspect there is an issue with the raw data, but I just want to make sure I've understood everything. Fastqc looks ok. There are some over-represented kmers, but I don't think that is the problem.
The problem seems to be I seem to have an excess of rare kmers (which seemingly indicates sequencing errors). Both ALLPATHS and ABYSS seem to be telling the same story (as they should!). But I am not sure why. I past the first few lines of the coverage.hist from ABYSS below.
These data have been quality trimmed using trim_galore. One relevant post I found suggests using QUAKE for correction. Is that still my next step? Or is it possible these data are fundamentally flawed? I guess one question is, what causes an excess of rare kmers if the quality scores of the data are very high?
Thanks!
Chris
1   1182909997
2   84699927
3   9033507
4   5000923
5   3572263
6   3223489
7   3322965
8   3843986
9   4777951
10  6183795
11  8121580
12  10580154
13  13495444
Looks like every other dataset to me. plot column1 vs column 2 and look at how many peaks you have. You will need to play with x/y axis ranges.
As Adrian says, kmer frequency histograms (for isolates) usually look like that. I recommend adapter-trimming if you have not already done so, however. You can also check for and remove contaminants, particularly human, which sometimes will contribute to low-frequency kmers.
oops, messed up plotting, one second...
Ok, here is the full plot. Is this really normal? This is from ~ 1 lane of PE of a bird genome. R isn't labelling my X axis at the moment, sorry..
That looks fine; typical diploid pattern. It's looks much more sane if you plot it on a log scale.
thanks... now I know... on to other forms of assembly trouble-shooting!
Looks like you have a diploid pattern like Brian said. Adpater trimming and quality trimming can remove the low frequency k-mers, as well as contaminants if any. Try SPAdes or QUAKE for read error correction before assembly (SPAdes does error correction and assembly as part of their pipeline).