RNA Seq: GC Content extra peak
1
1
Entering edit mode
4.9 years ago
rmf ★ 1.1k

I have some rna-seq data and they all have this odd extra peak in GC content. Adapter trimming was carried out using Skewer to produce trimmed reads. The adapter contamination is now gone but the GC profile still looks strange. Is this something I need to be concerned about? Is there a way to pull out or examine the reads that constitute this extra peak? I also noticed that the kmer profile has changed after trimming especially in the 3' end.

rna-seq fastqc gc • 5.2k views
1
Entering edit mode

If you expect this sample to contain two genomes (one example we had come across was of some Wolbachia that infects Drosophila, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1449785/ ).
You could try to bin the reads (use the genome you are working with) with BBSplit. See what remains and then map those reads to see what is going on.
Not sure if this is a good thing but the peak to the right is unchanged post-trimming.

1
Entering edit mode

% of reads map? Any of the unmapped reads adapter pairs or primer pairs? If you look at only the reads that map and reads that map to non-repeat regions do you see the same issues? Any sign of bacterial contamination in the unmapped reads? What about over-represented kmers in the report: if found do they hit anything in a blast search? Also have a look at http://qcfail.com for other ideas.

2
Entering edit mode
4.9 years ago
John 13k

In your final plots we can see that there is a sequence of DNA common to a very large proportion of your reads that occurs at the start and/or end of your reads. Trying to glue those kmers together in my head I think you end up with 2 distinct adapter sequences: "GCCGGCATCACCGCG", and "GTCGGCATCACCGCG" which are obviously both highly GC rich, but one is 2% higher than the other.

I believe this would be the cause of your other peak(s), and if you can find a way to trim that stuff off your reads, you'll be fine. You can test this if you like with SeQC, but you'd need to have a .stat method for identifying reads as starting with GCCGGCATCACCGCG. I could make one for you very quickly if you don't know python.

1
Entering edit mode

That's some excellent insight. The two sequences you suggested nicely coincides with the extra peaks in the GC content plots. I do find the two above sequences quite a lot in my reads. Blasting one of the sequences GCCGGCATCACCGCG, they hit various microbes. But, I found this sequence is part of a full read GCCGGCATCACCGCGGACCTCGGGCGCCCTTTTGGACGTGGGCCTGTCCCCGCCCTGGGCGGCACGGCGCGGTCGGAAACGGACTGAGGACAGTCCGTTCCGGTCGACGCACGCGCCGGGAGCGAG, which is also found around 500,000 times. Blasting this large sequence correctly hits my sequenced target organism Zebrafish. So I am not sure if I should remove this over-represented sequence from my reads. Because this is rna-seq data and they could genuinely be highly expressed transcripts. Besides, this sequence hits the rRNA gene which could have escaped the rRNA depletion step.

3
Entering edit mode

Overrepresentation of an rRNA gene could skew the normal distribution of your GC plot.

1
Entering edit mode

That looks about right, since the end is "GAGCGAG" which FastQC reports as being common to the end of the reads. I'm a bit confused as to why that sequence, if it is just that sequence, doesn't appear in the over-represented sequences list of FastQC, or why the position-kmer-plot doesn't show it occurring throughout all positions of the reads, so I think there's something else going on here. Perhaps write a little script to read the BAM file and print the Seq if the Seq starts with GCCGGCATCACCGCG or GTCGGCATCACCGCG, and ends with GAGCGAG, and then just eyeball if it's predominantly 1 sequence, or if those starts/end are constant but the stuff in the middle is changing.

I'd also look to see how many reads start with TCGGCATCACCGCG. If it's LESS than what you get for GTCGGCATCACCGCG, then something weird is going on. However... when is RNA-Seq not weird? :)