Question: K-Mer Correction In Rna-Seq Data For Transcriptome Assembly?
gravatar for Ryan Thompson
9.2 years ago by
Ryan Thompson3.4k
TSRI, La Jolla, CA
Ryan Thompson3.4k wrote:

In whole-genome high-throughput sequencing data, one expects a clear separation between high-frequency k-mers (signal) and low-frequency k-mers (noise arising from sequencing errors):

k-mer frequency distribution

Software such as Quake exists to take advantage of this separation to identify and correct low-frequency k-mers that represent sequencing errors. Removing these low-frequency k-mers should greatly reduce the memory usage of de-Bruijn graph-based assemblers, since every k-mer takes up the same amount of memory regardless of whether it occurs once or 1 billion times.

However, for RNA-Seq transcriptome assembly, the situation is different. Coverage is not even remotely uniform, so one cannot automatically assume a reasonable separation between the noise and signal peaks. Or to put it another way, the Quake website explicitly mentions that it is designed for use with WGS data with a coverage of at least 15x, and in an RNA-Seq experiment, many low-expressed transcripts will probably occur at well below 15x coverage.

So, is k-mer correction like that performed by Quake appropriate as a preprocessing step for RNA-Seq data before running a de-nove assembly with something like Velvet/Oases or Trinity, or is it likely to misidentify k-mers from low-coverage genes are error k-mers and attempt to correct them inappropriately?

ADD COMMENTlink modified 7.5 years ago by johnstantongeddes410 • written 9.2 years ago by Ryan Thompson3.4k
gravatar for Torst
9.2 years ago by
Torst960 wrote:

When you align your reads to your reference genome/exome/transcriptome, the alignment process already allows for some subsitutions (and maybe insertions and deletions). This works whether the errors are due to small differences between the reference and your organism, or due to actual sequencing errors. All those "low frequency kmers" won't get ignored, they will still be aligned to their closest match if they aren't too erroneous. You need to assess how many UNALIGNED/UNMAPPED reads you are getting. If that is too high, you can consider correcting your reads using k-mer frequency methods, but I suspect you won't need to. The chance of a corrected read now aligning to a different part of your genome is low.

ADD COMMENTlink written 9.2 years ago by Torst960

I completely agree, but perhaps the poster was thinking about transcriptome assembly, when there might be a greater need for k-mer correction?

ADD REPLYlink written 9.2 years ago by Mikael Huss4.7k

Oops, yes, I somehow managed to write that entire question without once writing the word "assembly". I'll edit my question to clarify.

ADD REPLYlink written 9.2 years ago by Ryan Thompson3.4k
gravatar for johnstantongeddes
7.5 years ago by
Burlington, VT
johnstantongeddes410 wrote:

I realize this is an old post, but I've recently come across the same issue. The best solution I've found is digital normalization from C. Titus Brown's group. The paper on arxiv states that "Digitial normalization ... normalizes average coverage to a specified value, reducing sampling variation while removing reads, and also removing the many errors contained within those reads."

Hope this helps someone else!

ADD COMMENTlink written 7.5 years ago by johnstantongeddes410
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1302 users visited in the last hour