Question: When should one be concerned about the CG content in RNA-seq experiment?
gravatar for dsastre
6 months ago by
dsastre0 wrote:

I'm learning on my own how to perform differential expression of RNA-seq data from GEO (human cell lines). I selected a dataset that has a very weird CG content (see image above). Is this indicative of rRNA contamination? Is SortMeRNA adequate to remove rRNA contamination from these samples? FastQC also indicates the presence of overrepresented sequences (up to 0.4%) that BLAST to mtDNA. Could this be driving the GC content? Your help is appreciated! Thanks!

enter image description here

rna-seq • 245 views
ADD COMMENTlink modified 6 months ago by Antonio R. Franco4.5k • written 6 months ago by dsastre0

Note, you are plotting the GC content of the genome and for mouse. You can use the theoretical GC content customization in MultiQC to instead plot the GC content of the human transcripts:

ADD REPLYlink modified 6 months ago • written 6 months ago by Michael Love2.1k
gravatar for Devon Ryan
6 months ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

That just looks like there are a few highly expressed genes, nothing to worry about. There's no reason to bother removing rRNA for a regular DE analysis, that just wastes CPU time. RNA-seq should often have an unusual GC profile, since it'll have peaks for highly expressed genes.

ADD COMMENTlink written 6 months ago by Devon Ryan97k

Adding on this, if you simply make sure that rRNA genes are not in the reference file that you use to make the count matrix then they should be removed anyway and will not affect the analysis. fastqc is a good tool but very low-level. Be sure that you normalize your data with a proper method such as TMM from edgeR or RLE from DESeq2 and then use PCA to see if replicates cluster well and to identify potential outliers- or batch effects. MA-plots are a good method to see if normalization performs reasonably well. DESeq2 has a plotPCA function that is easy to use if you are new to this field. Its manual is worth reading, same goes for the edgeR manual.

ADD REPLYlink written 6 months ago by ATpoint40k
gravatar for Antonio R. Franco
6 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

I got a similar GC content graphic when analyzing the transcriptome of olive roots infected with a fungus.

It turn out that after 7 and 15 days the population of new and opportunistic microorganisms evolved along the infection and changed the GC peaks from a unimodal peak to a composed curve like yours

What I am trying to say is that GC curves can also be compatible with the presence of contaminants.

To answer to this you need to know what percentage of the total reads map to your human sequences. If this is high this is compatible with the presence of a set of very expressed genes. It that percentage drops substantially, you can have an indication of contaminants reads

ADD COMMENTlink modified 5 months ago • written 6 months ago by Antonio R. Franco4.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour