I am conducting RNAseq analysis on raw reads of Solanum Lycopersicum (tomato). I am aligning the raw reads of 24 samples to the reference genome obtained from Ensemble using STAR. I am achieving mapping rates higher than 90% in most samples. All samples have a mapping rate of more than 86%, except for three samples with mapping rates of 25.7%, 7.5%, and 11.3%. The unmapped reads are attributed to "too short", which, based on my research, seem to be related to rRNA contamination.
This is peculiar because, as far as I know, samples with rRNA contamination typically exhibit more than one peak in the Per Sequence GC content plot. However, my samples only show a single peak and pass this test!
Regardless, my primary goal is to conduct a differential expression analysis. It's not possible for me to redo sequencing. I am uncertain whether I can exclude unmapped reads from the BAM file and proceed with the analysis for these three samples, or if I should omit them from the analysis altogether.
The samples with low mapping rates can be discarded as they are likely to be contaminated.