Question: GC content and Kmer
2
gravatar for Xin
3.5 years ago by
Xin60
Xin60 wrote:

Hi all ~~

I'm qualifying my reads for further RNA-seq analysis.

When I did fastqc for the first time , I got failure in

  • Per base sequence content
  • Per sequence GC content
  • Sequence duplication levels
  • Overrepresented sequences
  • Kmer content

Then I checked my overrepresented sequences in blast and those overrepresented sequences were for chloroplast genome. So I eliminated all chloroplast genome from my data.

Then I did fastqc again and I got failure in all I mentioned above except Overrepresented sequences. Still I had 4 failures.
After I read through websites I understood failure in Sequence duplications levels is Ok because they might be due to highly expressed transcripts. So I felt happy for Sequence duplication levels.

On the other hand, the failure in Per base sequence content was due to first 13 bases. So I trimmed them and this module also healed.

But still I have failure in Per sequence GC content and Kmer content. Kmer content is even worse than before and the peaks are all over the positions while at the beginning peaks were around first 12 positions (which I trimmed them).

To make myself sure of lack of adapter sequences (The adapter content in Fastqc is completely alright) I put all adapter sequences used by Illumina in a file and tried to trimmed them off but as I knew there were no adapters and nothing changed in quality.

So do you have any suggestion for having more qualified reads? Did I do sth wrong? Why Kmer became worse?

 

rna-seq • 3.7k views
ADD COMMENTlink modified 3.5 years ago by James Ashmore2.6k • written 3.5 years ago by Xin60
17
gravatar for Devon Ryan
3.5 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

For RNAseq datasets one typically sees failures in those modules. I should point out that FastQC is really geared toward whole-genome sequencing. We use it for all types of datasets, but many of them will have failures in various modules that can be ignored. In particular, the modules you mentioned will commonly fail on RNAseq samples and I wouldn't worry about that. Additionally, I would encourage you to not trim off the first 13 bases that are biased. The bases are actually correct and you'll lower your mapping rate (and cause some false alignments) by doing that.

BTW, remember with RNAseq data you expect anything short and highly expressed to be sequenced many many many times. That alone will cause a shift in the GC and kmer profile and cause "failures" in these tests. Don't worry about stuff like that.
 

ADD COMMENTlink written 3.5 years ago by Devon Ryan88k
4
gravatar for James Ashmore
3.5 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

Have a look at the expected and observed number of Kmers in your FastQC report. If you have say 30 million reads and your top Kmer appears only 2,000 or so times, this doesn't mean there is a problem with your library - for example in a ChIP-seq dataset you may see the binding motif come up as an enriched Kmer which would make sense. For RNA-seq I imagine there is something similar, such as sequencing the same region of the most highly expressed gene. Just a thought, although others may have better recommendations.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by James Ashmore2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour