Question

GC content and Kmer

2

Entering edit mode

8.5 years ago

Xin ▴ 70

Hi all ~~

I'm qualifying my reads for further RNA-seq analysis.

When I did fastqc for the first time, I got failure in

Per base sequence content
Per sequence GC content
Sequence duplication levels
Overrepresented sequences
Kmer content

Then I checked my over-represented sequences in blast and those over-represented sequences were for chloroplast genome. So I eliminated all chloroplast genome from my data.

Then I did fastqc again and I got failure in all I mentioned above except Over-represented sequences. Still I had 4 failures. After I read through websites I understood failure in Sequence duplication levels is Ok because they might be due to highly expressed transcripts. So I felt happy for Sequence duplication levels.

On the other hand, the failure in Per base sequence content was due to first 13 bases. So I trimmed them and this module also healed.

But still I have failure in Per sequence GC content and Kmer content. Kmer content is even worse than before and the peaks are all over the positions while at the beginning peaks were around first 12 positions (which I trimmed them).

To make myself sure of lack of adapter sequences (The adapter content in Fastqc is completely alright) I put all adapter sequences used by Illumina in a file and tried to trimmed them off but as I knew there were no adapters and nothing changed in quality.

So do you have any suggestion for having more qualified reads? Did I do something wrong? Why Kmer became worse?

RNA-Seq • 5.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.5 years ago by Xin ▴ 70

4

Entering edit mode

8.5 years ago

James Ashmore ★ 3.4k

Have a look at the expected and observed number of Kmers in your FastQC report. If you have say 30 million reads and your top Kmer appears only 2,000 or so times, this doesn't mean there is a problem with your library - for example in a ChIP-seq dataset you may see the binding motif come up as an enriched Kmer which would make sense. For RNA-seq I imagine there is something similar, such as sequencing the same region of the most highly expressed gene. Just a thought, although others may have better recommendations.

ADD COMMENT • link 8.5 years ago by James Ashmore ★ 3.4k

Ram · Accepted Answer · 2015-10-05

For RNAseq datasets one typically sees failures in those modules. I should point out that FastQC is really geared toward whole-genome sequencing. We use it for all types of datasets, but many of them will have failures in various modules that can be ignored. In particular, the modules you mentioned will commonly fail on RNAseq samples and I wouldn't worry about that. Additionally, I would encourage you to not trim off the first 13 bases that are biased. The bases are actually correct and you'll lower your mapping rate (and cause some false alignments) by doing that.

BTW, remember with RNAseq data you expect anything short and highly expressed to be sequenced many many many times. That alone will cause a shift in the GC and kmer profile and cause "failures" in these tests. Don't worry about stuff like that.