Question

Why does this plot of human GC-content have a peak at around 60%

4

Entering edit mode

8.3 years ago

willj ▴ 60

I'm new to RNA-Seq and have just run FastQC on my dataset. On the plots of GC content, all of the samples have a peak at around 60%, as shown here:

I've blasted a few of the most overrepresented sequences and each one hits multiple genes of multiple mammalian species with 100% identity. Each one hits the human signal recognition particle RNA (SRP 7SL), but also hits predicted targets in other mammals. Here's an example sequence:

GTTCTGGGCTGTAGTGCGCTATGCCGATCGGGTGTCCGCACTAAGTTCGG

Can anyone suggest what could be causing this? As I say, I'm new to RNA-Seq so it could be some beginners misunderstanding/ignorance. I haven't touched the data in any way (no trimming or any other quality cut-offs) - they are run directly through FastQC. As far as I can tell, the main quality measures (Per base sequence quality, Per sequence quality scores) are good, though several of the others (Per base sequence content, Adapter content, and kmer content) show red flags.

In case it's useful, these were paired end reads generated on Illumina Total RNA TRUSEQ.

Thank-you for any help.

RNA-Seq • 3.2k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.3 years ago by willj ▴ 60

0

Entering edit mode

Update: so I've tried trimming adapters but the GC peak is still there...

ADD REPLY • link 8.3 years ago by willj ▴ 60

0

Entering edit mode

The same happened to me with this overrrepresented sequence:GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC. In my case, with ChIP-seq data from lab mice models. I trimmed and the QC report just got worse (and the GC content plot almost didn´t change). I blasted it now and it shows 93% match with Staphylococcus phage Andhra, but it also appears in the adapter catalog. Because of the Blast I could think there´s a contamination of the DNA of that virus (it´s a double-stranded DNA virus), but bc of being also an adapter I would think it makes more sense that´s an adapter contamination. But if it is an adapter, also why it doesn´t appear in the "adapter content plot"? I would like to see some well-founded explanation of this, because so far I just read suggestions such as "proceed with the mapping anyways that probably it won´t affect too much", but no real explanation.

ADD REPLY • link 5.2 years ago by msimmer92 ▴ 300

Ram · Answer 1 · 2015-12-17

0

Entering edit mode

8.3 years ago

dariober 14k

It might be adapter contamination causing the spike. Try trimming the adapters and run fastQC again.

ADD COMMENT • link 8.3 years ago by dariober 14k

1

Entering edit mode

In the overrepresented sequences, I do sometimes get a hit on the TruSeq Adapter (below). However, when I blast this it does not give similar hits to the other sequences I mentioned above. Anyway, I'll try trimming as you say.

Sequence                                            Count     Percentage             Possible Source
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGC  232993    0.48295712295995535    TruSeq Adapter, Index 10 (100% over 50bp)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by willj ▴ 60

1

Entering edit mode

Hi, I've now trimmed the adapters and removed low quality reads but the peak is still there.

ADD REPLY • link 8.3 years ago by willj ▴ 60