Question: How to interpret Per sequence GC content module in FastQC for RNA-seq data?
2
gravatar for Arindam Ghosh
13 months ago by
Arindam Ghosh300
India
Arindam Ghosh300 wrote:

The Per sequence GC content graph gives an idea of contamination in present. How do we identify the type of contaminant? The shift in the curve towards left or right might give some idea, but how to interpret graph with multi peaks. I trimmed to remove the adopters and low quality bases but the peaks did not change.

enter image description here

Edit (21Aug19):: Is multiple peaks linked to samples with rRNA depletion method. I suspect this because I have both poly-A selected and rRNA depleted samples and multiple peaks occurred only in rRNA depleted samples. I further performed the downstream analysis and through Picard I observed higher intronic and intergenic bases and almost no rRNA aligned bases. Is this due to the presence of ncRNA besides mRNA?

fastqc rna-seq gc-content qc • 2.3k views
ADD COMMENTlink modified 11 months ago • written 13 months ago by Arindam Ghosh300
2

If you have rRNA contamination in your data you may see a shifted GC content peak. So that would be first thing to check.

There may be rare instances where having a dual GC peak (e.g. bacterial symbionts in Drosophila) may be expected.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax87k

My sample is from Human cell.

For rRNA contamination I believe it will be a smooth curve just shifted to one side. If you see the attached image, it seems to me a bit unusual, though not sure.

I did check with FastQ Screen for rRNA contamination but none detected. The histogram from the tool has three major bars, human, mouse and rat. Human is the tallest as expected.

ADD REPLYlink modified 13 months ago • written 13 months ago by Arindam Ghosh300
2

If you feel you have eliminated obvious contaminants then I suggest you go ahead and start with alignments/other analysis. If you detect a problem with those you can come back and investigate this further.

ADD REPLYlink written 13 months ago by genomax87k

do you have over-represented sequences ( fastqc should report you that ) ?

ADD REPLYlink written 11 months ago by Nicolas Rosewick9.0k

Yes. After trimming, few of the sequences that were reported as primer/adapter were removed and the once that remained have no known hit. I randomly selected few sequences and performed blast. For most there was no significant hit against human genomic + transcript db. While against all nucleotide I did observe BAC and few other species.

ADD REPLYlink modified 11 months ago • written 11 months ago by Arindam Ghosh300

I am curious to know if there's a method for understanding if rRNA contamination (or other sources that could drive a shift in the GC content distribution) has an effect on differential gene expression. I am also analyzing samples with abnormal GC distribution (for which overrepresented sequences included ribosomal proteins, specifically), but I don't really understand what else I can do...

ADD REPLYlink written 9 months ago by rodd100
1

It might if you have very uneven rRNA contamination across your samples. What fraction of your data do you estimate to be rRNA?

ADD REPLYlink written 9 months ago by genomax87k

Thanks, genomax! How can I estimate this? My alignment rate is between 75-80% for all samples, and the GC content distribution (after adapter trimming or not) looks like this (multiqc summary of fastqc analysis of all samples). Overrepresented sequences blast to ribosomal genes.

ADD REPLYlink modified 9 months ago • written 9 months ago by rodd100

Oh, 3 "overrepresented sequences" detected in FASTQC (blasting to the same ribosomal protein) each account for around 12% of the sequences in one of my samples... eeek Is this bad? All other samples look similar.

ADD REPLYlink modified 9 months ago • written 9 months ago by rodd100
1

If all samples have a similar rRNA content then it is possible that ribodepletion or mRNA enrichment did not work great. Not much you can do at this stage about that part.

You may be able to a) either remove the rRNA reads by binning bbsplit.sh from BBMap suite or SortMeRNA) or b) in RNAseq counting rRNA reads would not be counted.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax87k

Thanks for your suggestions, genomax, they're super helpful! I used SortMeRNA to remove rRNA from the reads in one sample, but the GC content distribution seen in FASTQC is pretty identical to the one observed prior to removing the rRNA. I would like to try option b), but I don't understand 100% what you mean. I am using kallisto for gene expression quantitation - does that mean that I should remove rRNA sequences from my reference index file (from kallisto)? Or should I just not call any ribosomal genes in R when I am importing my counts obtained with kallisto (using the regular reference file)? Thanks, and sorry for the basic Qs.

ADD REPLYlink modified 9 months ago • written 9 months ago by rodd100

I used SortMeRNA to remove rRNA from the reads in one sample, but the GC content distribution seen in FASTQC is pretty identical to the one observed prior to removing the rRNA

That GC distribution looks pretty extreme. Is this a metagenomic sample where one would expect to see GC distribution like this since multiple genome species would be present. What kind of data is this and is something different being done with the samples during prep?

ADD REPLYlink written 9 months ago by genomax87k

Hi genomax, thanks once again for your input! No, this is RNA-seq data from Poly-A enriched total RNA extracted from a human cell line (NEBNext Ultra Directional RNA Library Prep Kit for Illumina), so I am a bit surprised with the high GC peak I see in every sample. The technicians who constructed the library mentioned that all samples were treated equally, according to their protocols, but maybe one/some of their reagents had gone off? What do you think is the best course of action? This post recommends not deleting rRNA as this could bias the differential gene expression analysis.

ADD REPLYlink modified 9 months ago • written 9 months ago by rodd100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1500 users visited in the last hour