The Per sequence GC content graph gives an idea of contamination in present. How do we identify the type of contaminant? The shift in the curve towards left or right might give some idea, but how to interpret graph with multi peaks. I trimmed to remove the adopters and low quality bases but the peaks did not change.
Edit (21Aug19):: Is multiple peaks linked to samples with rRNA depletion method. I suspect this because I have both poly-A selected and rRNA depleted samples and multiple peaks occurred only in rRNA depleted samples. I further performed the downstream analysis and through Picard I observed higher intronic and intergenic bases and almost no rRNA aligned bases. Is this due to the presence of ncRNA besides mRNA?
If you have rRNA contamination in your data you may see a shifted GC content peak. So that would be first thing to check.
There may be rare instances where having a dual GC peak (e.g. bacterial symbionts in Drosophila) may be expected.
My sample is from Human cell.
For rRNA contamination I believe it will be a smooth curve just shifted to one side. If you see the attached image, it seems to me a bit unusual, though not sure.
I did check with FastQ Screen for rRNA contamination but none detected. The histogram from the tool has three major bars, human, mouse and rat. Human is the tallest as expected.
If you feel you have eliminated obvious contaminants then I suggest you go ahead and start with alignments/other analysis. If you detect a problem with those you can come back and investigate this further.
do you have over-represented sequences ( fastqc should report you that ) ?
Yes. After trimming, few of the sequences that were reported as primer/adapter were removed and the once that remained have no known hit. I randomly selected few sequences and performed blast. For most there was no significant hit against human genomic + transcript db. While against all nucleotide I did observe BAC and few other species.
I am curious to know if there's a method for understanding if rRNA contamination (or other sources that could drive a shift in the GC content distribution) has an effect on differential gene expression. I am also analyzing samples with abnormal GC distribution (for which overrepresented sequences included ribosomal proteins, specifically), but I don't really understand what else I can do...
It might if you have very uneven rRNA contamination across your samples. What fraction of your data do you estimate to be rRNA?
Thanks, genomax! How can I estimate this? My alignment rate is between 75-80% for all samples, and the GC content distribution (after adapter trimming or not) looks like this (multiqc summary of fastqc analysis of all samples). Overrepresented sequences blast to ribosomal genes.
Oh, 3 "overrepresented sequences" detected in FASTQC (blasting to the same ribosomal protein) each account for around 12% of the sequences in one of my samples... eeek Is this bad? All other samples look similar.
If all samples have a similar rRNA content then it is possible that ribodepletion or mRNA enrichment did not work great. Not much you can do at this stage about that part.
You may be able to a) either remove the rRNA reads by binning bbsplit.sh from BBMap suite or SortMeRNA) or b) in RNAseq counting rRNA reads would not be counted.
Thanks for your suggestions, genomax, they're super helpful! I used SortMeRNA to remove rRNA from the reads in one sample, but the GC content distribution seen in FASTQC is pretty identical to the one observed prior to removing the rRNA. I would like to try option b), but I don't understand 100% what you mean. I am using kallisto for gene expression quantitation - does that mean that I should remove rRNA sequences from my reference index file (from kallisto)? Or should I just not call any ribosomal genes in R when I am importing my counts obtained with kallisto (using the regular reference file)? Thanks, and sorry for the basic Qs.
That GC distribution looks pretty extreme. Is this a metagenomic sample where one would expect to see GC distribution like this since multiple genome species would be present. What kind of data is this and is something different being done with the samples during prep?
Hi genomax, thanks once again for your input! No, this is RNA-seq data from Poly-A enriched total RNA extracted from a human cell line (NEBNext Ultra Directional RNA Library Prep Kit for Illumina), so I am a bit surprised with the high GC peak I see in every sample. The technicians who constructed the library mentioned that all samples were treated equally, according to their protocols, but maybe one/some of their reagents had gone off? What do you think is the best course of action? This post recommends not deleting rRNA as this could bias the differential gene expression analysis.