While looking at the RNA-seq mapping statistics from STAR(same batch,same library prep used for sequencing),I often see different numbers of mapped reads ranging from 18 million to 40 million mapped reads.Both kinds of samples(20 million mapping reads vs 40 million mapping reads) can give median of log of fpkms to be around 1, which I think is considered a good quality sample for downstream processing.
However,despite of having lets say 40 million mapping reads,some samples end up having median of log of fpkms to be around 0.5.
I have few questions in this regard.
Can we compare this sample in the same cohort as the one that has median of log of fpkms to be around 1 (both having 40 million mapped reads)?
Can we compare a sample with 20 million mapped reads with median log of fpkms around 0.5 to a sample with 40 million mapped reads with median of log of fpkms around 1 , or vice-versa(it is also possible) ?
Is there any minimum number of mapped reads that should be taken as a threshold e.g 20 million mapped reads for passing QC?
If we can not compare them directly,should we do batch correction for comparing them even though they are from same batch?
Lastly,I am using(http://rseqc.sourceforge.net/#spilt-bam-py) for getting a count of rRNA reads,do these reads form a part of mapped reads or they are a part of overall input reads?Since I am taking the mapped bam file,my guess is they are included on only mapped reads and if they are high in number-- this results in lower quality of FPKMS even though we have 40 million mapped reads or greater(I mean sufficient mapped reads)