I am analyzing differential gene expression for a plant based on 3'cDNA tag libraries generated from total RNA. I used SortMeRNA to remove rRNA followed by using trimmomatic that removed the multiplex adapter as well as quality filtered the reads. After the filtering and trimming, I checked my read quality with FASTQC (see at bottom of post). I am getting multiple overrepresented sequences that all contain the same 46bp:
I ran a BLASTn of this sequence on both NCBI nr/nt database and the refseq_rna database. When not specifying my plant, the results are found in the chloroplast genome of other plants. I ran against my specific plant genome and the only hit is for 26bp that is for the 10B subunit of the mediator of RNA polymerase (92% ident, E = 0.13).
My assumption so far is that this is a contaminant and due being present in ~25% of my reads it is causing the skew in the GC content as well as per base sequence content results of the FASTQC report.
So my questions are:
1) Should I remove this sequence and if so, how?
2) What other reasons could explain the per base sequence content issues since the adapters were removed?
3) Since this data is RNA seq looking at differential gene expression and I will be mapping against the transcriptome via Salmon - does it matter about the FASTQC flags I am seeing in this report?
*Added note (10.24.19) - These are unsterilized plant leaf samples that were used for this experiment so other euk RNA may be present - such as fungi - though very low to null in comparison. Additionally the read depth was not high enough to really capture the fungal RNA. Also, since this is a RNAseq analysis, we expect to have high duplication levels for some transcripts.