I am new to RNA-seq data analysis and need some help.
I sampled 1,000,000 reads from a mRNA RNA-seq experiment of yeast (SRR1177156). The library is a single-end, forward-stranded one, with 50bp reads, sequencing the reference strain S288C at standard growth conditions. I then created a reference genome index using STAR, with the latest genome sequence and annotation from GDB. I mapped the reads to the genome, using STAR again and then created a QA report using Qualimap (attached).
I was quite surprised to see that only ~53% of the reads were reliably mapped. As you can see in the attached report, I have lots of secondary and non-unique alignments. I looked at some example data sets provided by Qualimap, and it seems that mapping proportion is usually close to 100%. Can you help me think of possible reasons for why I'm getting such low values?
After a bit of looking around I note that the data you are looking is 50bp read length, single-end.
If you look at the paper (I hope this is the main publication), the authors themselves say that they expect only about ~30bp of the mRNA to have been protected by the ribosome. Further in the Supple (p.5), authors say that beyond the first 21bases, the read stretch could be from homopolymer tail.
I have not delved further into specifics. But the gist I suppose is that this dataset probably has reads that have information content only upto 20-25 bases. In this situation it would be difficult for any aligner to find unique 'hits' to the yeast genome, because of the small query size. On top of that, since this is RNA, there would be many reads that wouldn't be aligning in one single stretch on genome, but would be split across introns. As to start with, you probably have only 20-25 bases of good quality length, it would be quite tricky for STAR to find 'unique' hits for reads that span splice-junction, as STAR tries to break the read (which for 20-25bp valid len. would be still smaller) and look for unique hit onto genome and from there use the transcriptome info. (GTF provided) to bridge over the intron. This is the reason probably why you see only ~53% of reads reliably mapped by STAR.
Is that what you are calling uniquely mapped reads? What was the total % of reads that mapped? It was probably over 90%. If you have a lot of secondary and non-unique alignments that is a characteristic of the dataset you have. Nothing you can do about it.
Thanks for the answer. This is the Qualimap "Number of mapped reads", which according to the manual is "total number of mapped reads (left/right in case of paired-end reads, secondary alignments are ignored)". I actually have 0 unmapped (SAM flag 4) reads. Can you explain a bit more about this "haracteristic of the dataset" and what could cause it?