Something has been really troubling me about a public dataset that I am currently working on. The sequencing quality was of course not very great and there were many reads containing runs of As and Ts. What was super weird was when I switched from an older rice genome reference of 2011 provided by Rice Genome Annotation Project to ensemble genome top.level.fa release 48 (in 2020), I expected more reads to map uniquely due to genome assembly improvement. However, it didn't happen. I consistently observed the decrease in the percentage of uniquely mapping reads ranging from (0.05-1%) in most of the samples and up to 12-18% in a few of the samples. The alignments were performed according to the Encode STAR parameters and also in default mode but the percentage remained fairly the same.
However, the amount of multi mapping reads and unmapped reads were high, ranging from 20%-75%. I suspected the presence of some novel transcripts in the samples having high multi mapping reads so I ran it in a 2-pass mode of STAR. However, it further led to a decrease in Uniquely mapping reads and chimeric reads were found to be barely 2 percent.
I checked the unmapped reads too and realized most of them had long poly-T and polyA tails. I am unable to understand if there is a way to rescue the data.