When should one be concerned about the CG content in RNA-seq experiment?
2
0
Entering edit mode
11 months ago
dsastre • 0

I'm learning on my own how to perform differential expression of RNA-seq data from GEO (human cell lines). I selected a dataset that has a very weird CG content (see image above). Is this indicative of rRNA contamination? Is SortMeRNA adequate to remove rRNA contamination from these samples? FastQC also indicates the presence of overrepresented sequences (up to 0.4%) that BLAST to mtDNA. Could this be driving the GC content? Your help is appreciated! Thanks!

RNA-Seq • 456 views
1
Entering edit mode

Note, you are plotting the GC content of the genome and for mouse. You can use the theoretical GC content customization in MultiQC to instead plot the GC content of the human transcripts:

https://multiqc.info/docs/#theoretical-gc-content

2
Entering edit mode
11 months ago

That just looks like there are a few highly expressed genes, nothing to worry about. There's no reason to bother removing rRNA for a regular DE analysis, that just wastes CPU time. RNA-seq should often have an unusual GC profile, since it'll have peaks for highly expressed genes.

1
Entering edit mode

Adding on this, if you simply make sure that rRNA genes are not in the reference file that you use to make the count matrix then they should be removed anyway and will not affect the analysis. fastqc is a good tool but very low-level. Be sure that you normalize your data with a proper method such as TMM from edgeR or RLE from DESeq2 and then use PCA to see if replicates cluster well and to identify potential outliers- or batch effects. MA-plots are a good method to see if normalization performs reasonably well. DESeq2 has a plotPCA function that is easy to use if you are new to this field. Its manual is worth reading, same goes for the edgeR manual.

2
Entering edit mode
11 months ago

I got a similar GC content graphic when analyzing the transcriptome of olive roots infected with a fungus.

It turn out that after 7 and 15 days the population of new and opportunistic microorganisms evolved along the infection and changed the GC peaks from a unimodal peak to a composed curve like yours

What I am trying to say is that GC curves can also be compatible with the presence of contaminants.

To answer to this you need to know what percentage of the total reads map to your human sequences. If this is high this is compatible with the presence of a set of very expressed genes. It that percentage drops substantially, you can have an indication of contaminants reads