I'm hoping to get some help resolving issues of apparent rRNA contamination in RNA-seq data, which isn't something I've dealt with in my limited previous experience. I have sequence data from tissues of a non-model insect, prepared using RiboZero which had previously worked well. The species is not well annotated, though I do have access to a draft genome fasta. My intention is to perform a differential expression analysis.
FastQC showed fastq files (post-trimming for adapter sequences) had additional peaks outside the expected range. To check for cross-species contamination I tried aligning the reads to the draft genome and found that the alignment rates were quite good, but the resulting SAM files showed the same additional peaks. Because this seemed to suggest rRNA contamination, I removed known eukaryotic rRNA sequences (from silva and rfam databases) using SortMeRNA. This produced better-looking GC content plots but there were still large additional peaks, positions of which seemed consistent with the sequences FASTQC identified as overrepresented. These sequences are also very consistent across the samples. I removed the 'overrepresented' sequences using Trimmomatic and the GC content plots look better, but still look a little off.
Representative (R2) GC plots for one of the samples
Before rRNA trimming:
After rRNA trimming:
After removal of overrepresented sequences:
I presume these additional overrepresented sequences are portions of rRNAs that aren't included in the database of eukaryotic rRNA: they're present in the draft genome assembly, and FastQC doesn't identify any known hits. They're also all 50 bases, which I'm guessing is the limit of the length of overrepresented sequences FASTQC will report.
I'm hoping for advice on what to do next. While the steps mentioned above improved the appearance of the GC plots, I'm concerned that the small remaining peaks could still bias normalisation of counts and subsequent DE analysis (though the remaining peaks actually look very consistent across samples). If these overrepresented sequences are rRNAs then I hope someone might be able to suggest a better or more complete way to identify and/or remove them (additional to the apparent 50bp limit, from what I remember FastQC bases identification of overrepresented sequences on just a portion of the total number of sequences).
Any help or advice will be much appreciated.
I hope you are not using this as input for the differential gene expression analysis: you should trim adapters and remove ribosomal RNA ( and maybe bacterial contaminants), but nothing else. You will be introducing biases onto your DGE analysis if you remove over-represented sequences.
I will extend genomax comment: insects are frequently associated with particular bacteria, and often times, the number of bacterial cells is really high. You used RiboZero, which removes ribosomal RNA and leaves all remaining RNAs - both from your insect and its associated bacteria. Wolbachia is a prime example: it is intracellular, with generally very high cell number in its host ovaries, but generally found in all tissues.
If you are mapping to the host genome, these bacterial genes will have almost no impact on your analysis, as they won't map. If the ribosomal genes are annotated on the reference genome, you could filter them out before proceeding with differential expression. However, if bacterial / rRNA contamination is too strong, they may reduce too much other genes counts, thus reducing the "signal" you want to study.
Thanks for the response. As mentioned above the additional peaks do align to the draft genome. I suppose it's possible they are bacterial sequences present in the draft genome, but I don't know how likely that is. To me, the fact filtering rRNAs improved the GC content plots seems to imply the leftover peaks (at the same locations along the X-axis as the peaks associated with rRNA) are also rRNAs that don't align to the known sequences in existing databases, but perhaps I'm mistaken.
My understanding is that the effectiveness of RiboZero differs between species, and is known to be affected by input concentrations (these tissues are small, so the input was relatively low). My interpretation was therefore that in this case it hadn't worked well due to low input and/or distant relatedness of the study species to the organisms for which it was designed.
I understand that removing overrepresented sequences is a touchy subject, but I don't see how it would strongly bias my DE analysis - I am removing the same set of overrepresented sequences (which are consistent between samples) from all samples. However I accept that this probably isn't the best approach, which is why I'm looking for alternatives.
It isn't a touchy subject, you just shouldn't do it if you want to perform differential expression analysis. You don't want to alter gene abundances if you want to analyze differences in gene abundances. Highly expressed genes will be considered as "over-represented sequences", you will be down-sampling them, possibly to the same abundance in all samples.
Mapping to the reference genome and TMM normalization will take care of the bias you are concerned.
I appreciate the advice but don't see how the situation is resolved by ignoring the fact that a relatively large proportion of the sequences for each sample appear to be non-mRNAs as an artefact of library preparation. I would think adjusting counts by library size would be more heavily biased by any differences in the representation of non-mRNAs than it would by removing the same set of sequences - which strongly appear to be rRNA - from all samples.
In previously sequenced samples, which show GC content plots that do not hint at the presence of additional non-mRNAs, nor any cross-species contamination (i.e. are consistently normally distributed), no sequences are identified by FastQC as overrepresented (i.e. no sequences account for more than 0.1% of the total sequences, let alone >1% as in the first 2 plots above).
As mentioned in the original question, I don't see removing overrepresented sequences (as identified by FastQC) as an ideal solution, and am looking for better alternatives, however looking at the second GC content plot in the original post, I can't imagine it would be advisable to perform DE analysis without some additional filtering.
So the 85% alignment rate refers to cleaned sequences? What is the original % alignment, if you only removed the adapter sequence?
Is that statement based on alignment to a related rDNA repeat (assuming that you don't know/have the actual rDNA repeat for this organism)?
You have a reference genome (and possibly annotation?). Then:
if these over-represented sequences do not map to the reference genome, they will have no influence on the downstream gene / transcript expression estimation and DGE analysis.
if these over-represented sequences map to to the reference genome, it means it is being expressed, and removing them without knowing what they are seems to me a bad idea. Even if they map, they will only have an influence on DGE if they are counted - that is, if they map to an annotated feature.
Did you check the GC content of the mapped reads?
You performed RNAseq and FastQC showd no over-represented sequences? I consider this as strange.
(in response to both above comments)
I'll run the genome alignment pre-filtering now and update.
The reason I say the additional overrepresented sequences appear to be rRNA is that they are associated with peaks at very similar positions with respect to the x-axis as those that were identified as known rRNAs.
I only have access to a draft genome fasta, I don't have an annotation, which I think is ongoing (this is not my work). I don't actually intend to align to the genome for my analysis, at least for now, because its development is ongoing - I am just using it for QC purposes. My initial plan was to produce a de novo transcriptome from the sequenced reads, which is why I'm concerned by the presence of any additional non-mRNAs.
If all else fails, a plan B is to produce a de novo transcriptome using the previously sequenced data for similar tissues mentioned above, which don't show the same evidence of contamination, and align the reads to this. This is not a perfect solution though as those data are only for males, whereas the new data include both sexes.
Have you considered the possibility that you have:
Have you tried to align the data (irrespective of what FastQC says) and checked what the alignment % look like?
The alignment rate to the draft genome is as good as I could have expected (~85%), and the aligned SAM files show the same additional GC content peaks. The overrepresented sequences don't have any significant hits when BLASTed against the nr database, but do align to the draft genome, which I have taken as indicating they're rRNA (or else some other sequences represented in the genome) rather than contamination of a second species.
Intuitively, I had also thought contamination of a second species would produce a bimodal GC content graph, rather than the very messy multiple additional peaks in the plot above (I'm not sure what would be expected in the case of rRNA).
If there is a related species (where the rDNA repeat seq is known) then you could try and align them against each other to see if there are any similarities. Since you had good alignment you may be able to move forward with your DE analysis (as long as all samples show this pattern).
Please follow the directions here and edit the original post: How to add images to a Biostars post