What's wrong with this sample? (kmers found by FastQC of RNA-Seq)
Entering edit mode
5.2 years ago
Nick ▴ 290

I've run FastQC on a sample of Illumina RNA-Seq. It identifies issues with abnormal kmer counts:



There is the as well the kmer table:


What's wrong with this sample? What do I need to do? I reckon this points to a contamination of sorts. I've already run trimmomatic on this sample using the standard Truseq adapters so I am not sure what to make of this.


Here is the sequence content across all bases:


Here is the GC content:


As advised by Amitm I rerun FastQC using the latest version (0.11.4). The previous analysis was done by version 0.10.1. There are interesting differences. I am posting below the graph for the per base sequence content:

per base sequence content

In particular, the kmer abnormalities seem to be concentrated at the start of the reads:



This looks not too hopeless to me even though I still have no clue how to deal with it. On a different note, I am slightly worried about the discrepancies between the different FastQC versions. I've always assumed that they would produce, more or less, identical results. I realise now that this is somewhat naive. I dread to think what I would find if I rerun the latest version of FastQC on samples I've analysed in the past. 


RNA-Seq • 3.1k views
Entering edit mode
5.2 years ago
John 13k

This happens when you have an over-representation of poly-something. Because "TGTGT" and "GTGTG" look the same to the k-mer tool (as does their reverse compliment ACACA and CACAC), you see these weird slopes with peaks that interleave.

It's difficult to be any more specific than that without breaking down the kmers for forward and reverse reads separately.
Not that you should spend any time trying to - the polymer contamination is so significant that there is no way for you to clean the reads up without FastQC pulling a Gandalf, so trying to salvage any information from this is a lost cause. You're going to have to repeat the sequencing mate :(

Entering edit mode
5.2 years ago
Amitm ★ 2.1k


Your "Per base sequence content" has an exclamation. For an OK data (mammalian) it should look like this -

Does it so?

How many reads you had to start with? Also if there is contamination from 'known' adapters then you would already see it in the "Overrep seq" plot. Btw, if you use the newer version of FastQC, then you also get a separate plot for "Adapter content". Just helps.

Also the per-seq.GC content plot should be unimodal with the theoretical more or less overlapping the pbserved.

And, a good RNA-seq (I mean with good coverage of transcriptome) would always throw a "fail" for the 'Seq duplication level'. That means that you have sequenced most mRNAs in multiple copies; which is the idea of the RNA-seq. See if these are satisfied.

Entering edit mode

Thanks for suggesting to use the latest FastQC. The adaptor content is green (no issues identified).


Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6