Question: Why does this plot of human GC-content have a peak at around 60%
gravatar for willj
4.8 years ago by
willj40 wrote:

I'm new to RNA-Seq and have just run FastQC on my dataset. On the plots of GC content, all of the samples have a peak at around 60%, as shown here:

I've blasted a few of the most overrepresented sequences and each one hits multiple genes of multiple mammalian species with 100% identity. Each one hits the human signal recognition particle RNA (SRP 7SL), but also hits predicted targets in other mammals. Here's an example sequence:


Can anyone suggest what could be causing this? As I say, I'm new to RNA-Seq so it could be some beginners misunderstanding/ignorance. I haven't touched the data in any way (no trimming or any other quality cut-offs) - they are run directly through FastQC. As far as I can tell, the main quality measures (Per base sequence quality, Per sequence quality scores) are good, though several of the others (Per base sequence content, Adapter content, and kmer content) show red flags.

In case it's useful, these were paired end reads generated on Illumina Total RNA TRUSEQ.

Thank-you for any help.

rna-seq • 2.1k views
ADD COMMENTlink modified 4.8 years ago by dariober11k • written 4.8 years ago by willj40

Update: so I've tried trimming adapters but the GC peak is still there...

ADD REPLYlink written 4.8 years ago by willj40

The same happened to me with this overrrepresented sequence:GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC. In my case, with ChIP-seq data from lab mice models. I trimmed and the QC report just got worse (and the GC content plot almost didn´t change). I blasted it now and it shows 93% match with Staphylococcus phage Andhra, but it also appears in the adapter catalog. Because of the Blast I could think there´s a contamination of the DNA of that virus (it´s a double-stranded DNA virus), but bc of being also an adapter I would think it makes more sense that´s an adapter contamination. But if it is an adapter, also why it doesn´t appear in the "adapter content plot"? I would like to see some well-founded explanation of this, because so far I just read suggestions such as "proceed with the mapping anyways that probably it won´t affect too much", but no real explanation.

ADD REPLYlink modified 20 months ago • written 20 months ago by msimmer92260
gravatar for dariober
4.8 years ago by
WCIP | Glasgow | UK
dariober11k wrote:

It might be adapter contamination causing the spike. Try trimming the adapters and run fastQC again.

ADD COMMENTlink written 4.8 years ago by dariober11k

In the overrepresented sequences, I do sometimes get a hit on the TruSeq Adapter (below). However, when I blast this it does not give similar hits to the other sequences I mentioned above. Anyway, I'll try trimming as you say.

Sequence                                            Count     Percentage             Possible Source
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC  232993    0.48295712295995535    TruSeq Adapter, Index 10 (100% over 50bp)
ADD REPLYlink modified 10 months ago by RamRS30k • written 4.8 years ago by willj40

Hi, I've now trimmed the adapters and removed low quality reads but the peak is still there.

ADD REPLYlink written 4.8 years ago by willj40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1970 users visited in the last hour