I'm hoping to get some help resolving issues of apparent rRNA contamination in RNA-seq data, which isn't something I've dealt with in my limited previous experience. I have sequence data from tissues of a non-model insect, prepared using RiboZero which had previously worked well. The species is not well annotated, though I do have access to a draft genome fasta. My intention is to perform a differential expression analysis.
FastQC showed fastq files (post-trimming for adapter sequences) had additional peaks outside the expected range. To check for cross-species contamination I tried aligning the reads to the draft genome and found that the alignment rates were quite good, but the resulting SAM files showed the same additional peaks. Because this seemed to suggest rRNA contamination, I removed known eukaryotic rRNA sequences (from silva and rfam databases) using SortMeRNA. This produced better-looking GC content plots but there were still large additional peaks, positions of which seemed consistent with the sequences FASTQC identified as overrepresented. These sequences are also very consistent across the samples. I removed the 'overrepresented' sequences using Trimmomatic and the GC content plots look better, but still look a little off.
Representative (R2) GC plots for one of the samples
Before rRNA trimming:
After rRNA trimming:
After removal of overrepresented sequences:
I presume these additional overrepresented sequences are portions of rRNAs that aren't included in the database of eukaryotic rRNA: they're present in the draft genome assembly, and FastQC doesn't identify any known hits. They're also all 50 bases, which I'm guessing is the limit of the length of overrepresented sequences FASTQC will report.
I'm hoping for advice on what to do next. While the steps mentioned above improved the appearance of the GC plots, I'm concerned that the small remaining peaks could still bias normalisation of counts and subsequent DE analysis (though the remaining peaks actually look very consistent across samples). If these overrepresented sequences are rRNAs then I hope someone might be able to suggest a better or more complete way to identify and/or remove them (additional to the apparent 50bp limit, from what I remember FastQC bases identification of overrepresented sequences on just a portion of the total number of sequences).
Any help or advice will be much appreciated.