Low percent of RNA-seq reads mapping
1
0
Entering edit mode
3.2 years ago
liorglic ★ 1.2k

Hello,
I am new to RNA-seq data analysis and need some help. I sampled 1,000,000 reads from a mRNA RNA-seq experiment of yeast (SRR1177156). The library is a single-end, forward-stranded one, with 50bp reads, sequencing the reference strain S288C at standard growth conditions. I then created a reference genome index using STAR, with the latest genome sequence and annotation from GDB. I mapped the reads to the genome, using STAR again and then created a QA report using Qualimap (attached).
I was quite surprised to see that only ~53% of the reads were reliably mapped. As you can see in the attached report, I have lots of secondary and non-unique alignments. I looked at some example data sets provided by Qualimap, and it seems that mapping proportion is usually close to 100%. Can you help me think of possible reasons for why I'm getting such low values?

Thanks a lot!

RNA-Seq STAR mapping • 1.7k views
0
Entering edit mode

~53% of the reads were reliably mapped

Is that what you are calling uniquely mapped reads? What was the total % of reads that mapped? It was probably over 90%. If you have a lot of secondary and non-unique alignments that is a characteristic of the dataset you have. Nothing you can do about it.

0
Entering edit mode

Thanks for the answer. This is the Qualimap "Number of mapped reads", which according to the manual is "total number of mapped reads (left/right in case of paired-end reads, secondary alignments are ignored)". I actually have 0 unmapped (SAM flag 4) reads. Can you explain a bit more about this "haracteristic of the dataset" and what could cause it?

2
Entering edit mode
3.2 years ago
Amitm ★ 2.1k

Hi, After a bit of looking around I note that the data you are looking is 50bp read length, single-end. If you look at the paper (I hope this is the main publication), the authors themselves say that they expect only about ~30bp of the mRNA to have been protected by the ribosome. Further in the Supple (p.5), authors say that beyond the first 21bases, the read stretch could be from homopolymer tail. I have not delved further into specifics. But the gist I suppose is that this dataset probably has reads that have information content only upto 20-25 bases. In this situation it would be difficult for any aligner to find unique 'hits' to the yeast genome, because of the small query size. On top of that, since this is RNA, there would be many reads that wouldn't be aligning in one single stretch on genome, but would be split across introns. As to start with, you probably have only 20-25 bases of good quality length, it would be quite tricky for STAR to find 'unique' hits for reads that span splice-junction, as STAR tries to break the read (which for 20-25bp valid len. would be still smaller) and look for unique hit onto genome and from there use the transcriptome info. (GTF provided) to bridge over the intron. This is the reason probably why you see only ~53% of reads reliably mapped by STAR.

0
Entering edit mode

Thanks for taking this deep dive into the data. I actually took them from this paper. In this study, both mRNA-seq and ribosome profiling (what they call footprint - FP) were performed, and as far as I understand, I used the mRNA data, so why would only part of the reads be eligible?

0
Entering edit mode

Hi, I would suggest to run FastQC on it to see if there is adapter contam., or over-represented sequences. Presence of adapter stretch in the fastq read effectively makes the 'usable' length shorter. So that is one thing to check. Also, if this is total RNA-sequencing then after rRNA depletion there is fair chance of rRNA pseudogenes. I am not sure if yeast has them, but you could check.

0
Entering edit mode

Thanks again. There is no sign for adapter (or other) issues with the data. I also don't know about rRNA genes in yeast. What I did notice is that in one of the supplementary tables of the paper, it is indicated that a low % of reads were mapped uniquely (actually below 50%), so maybe I should ask the authors for their thoughts about this.

0
Entering edit mode

Hi, I am not sure how much time you would want to put into finding out about the data quality. Of course the authors might be able to give you some reasons. You could also, if you haven't done already, use samtools to extract the 'secondary alignments' and manually check some of the reads using any the NCBI or SGD BLAST. samtools view -f 256 My.bam |head -n 3