Question

What's wrong with this sample? (kmers found by FastQC of RNA-Seq)

1

Entering edit mode

8.2 years ago

Nick ▴ 290

I've run FastQC on a sample of Illumina RNA-Seq. It identifies issues with abnormal kmer counts:

kmers

There is the as well the kmer table:

kmer-table

What's wrong with this sample? What do I need to do? I reckon this points to a contamination of sorts. I've already run trimmomatic on this sample using the standard Truseq adapters so I am not sure what to make of this.

EDIT

Here is the sequence content across all bases:

Here is the GC content:

EDIT2

As advised by Amitm I rerun FastQC using the latest version (0.11.4). The previous analysis was done by version 0.10.1. There are interesting differences. I am posting below the graph for the per base sequence content:

per base sequence content

In particular, the kmer abnormalities seem to be concentrated at the start of the reads:

kmer-content

This looks not too hopeless to me even though I still have no clue how to deal with it. On a different note, I am slightly worried about the discrepancies between the different FastQC versions. I've always assumed that they would produce, more or less, identical results. I realise now that this is somewhat naive. I dread to think what I would find if I rerun the latest version of FastQC on samples I've analysed in the past.

RNA-Seq • 4.4k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by Nick ▴ 290

score 2 · Answer 1 · 2016-01-30

This happens when you have an over-representation of poly-something. Because "TGTGT" and "GTGTG" look the same to the k-mer tool (as does their reverse compliment ACACA and CACAC), you see these weird slopes with peaks that interleave.

It's difficult to be any more specific than that without breaking down the kmers for forward and reverse reads separately.
Not that you should spend any time trying to - the polymer contamination is so significant that there is no way for you to clean the reads up without FastQC pulling a Gandalf, so trying to salvage any information from this is a lost cause. You're going to have to repeat the sequencing mate :(

score 1 · Answer 2 · 2016-01-30

hi,

Your "Per base sequence content" has an exclamation. For an OK data (mammalian) it should look like this -

Does it so?

How many reads you had to start with? Also if there is contamination from 'known' adapters then you would already see it in the "Overrep seq" plot. Btw, if you use the newer version of FastQC, then you also get a separate plot for "Adapter content". Just helps.

Also the per-seq.GC content plot should be unimodal with the theoretical more or less overlapping the pbserved.

And, a good RNA-seq (I mean with good coverage of transcriptome) would always throw a "fail" for the 'Seq duplication level'. That means that you have sequenced most mRNAs in multiple copies; which is the idea of the RNA-seq. See if these are satisfied.