RNA-seq quality control
1
0
Entering edit mode
7.4 years ago
Ron ★ 1.2k

Hi,

While looking at the RNA-seq mapping statistics from STAR(same batch,same library prep used for sequencing),I often see different numbers of mapped reads ranging from 18 million to 40 million mapped reads.Both kinds of samples(20 million mapping reads vs 40 million mapping reads) can give median of log of fpkms to be around 1, which I think is considered a good quality sample for downstream processing.

However,despite of having lets say 40 million mapping reads,some samples end up having median of log of fpkms to be around 0.5.

I have few questions in this regard.

Can we compare this sample in the same cohort as the one that has median of log of fpkms to be around 1 (both having 40 million mapped reads)?

Can we compare a sample with 20 million mapped reads with median log of fpkms around 0.5 to a sample with 40 million mapped reads with median of log of fpkms around 1 , or vice-versa(it is also possible) ?

Is there any minimum number of mapped reads that should be taken as a threshold e.g 20 million mapped reads for passing QC?

If we can not compare them directly,should we do batch correction for comparing them even though they are from same batch?

Lastly,I am using(http://rseqc.sourceforge.net/#spilt-bam-py) for getting a count of rRNA reads,do these reads form a part of mapped reads or they are a part of overall input reads?Since I am taking the mapped bam file,my guess is they are included on only mapped reads and if they are high in number-- this results in lower quality of FPKMS even though we have 40 million mapped reads or greater(I mean sufficient mapped reads)

Thanks,

Ron

RNA-Seq next-gen alignment QC STAR • 1.9k views
ADD COMMENT
2
Entering edit mode
7.4 years ago
  1. FPKMs shouldn't be used for any actual statistics, so having a different median value isn't an issue.
  2. The absolute number of mapped reads isn't the important thing, rather the difference in numbers. In general, you tend to run into issues when there's more than a 10x difference in the number of alignments between libraries.
  3. You don't have a batch to correct for.
  4. Again, you shouldn't do statistics with FPKMs. Just take the counts from featureCounts (or STAR, which I think can directly output them these days).
ADD COMMENT
0
Entering edit mode

Hi Devon,

Thanks! I have an example where samples with similar mapping statistics with similar quality of FPKMS but there is one sample in the batch with similar mapping statistics but different FPKMS quality clustering separately.I am not sure whether its a real difference or a batch effect.All samples are of same disease RNASeq.

On other note,I had another question whether rRNA reads are included in the BAM File as well or just are in total reads?

ADD REPLY
0
Entering edit mode

Whether rRNA alignments are included in the BAM for and/or FPKMs is dependent on how you made both. My guess is that your FPKMs are calculated using total reads, rather than mapped reads, and that you just have differences in rRNA amounts between the samples.

ADD REPLY
0
Entering edit mode

FPKMS and rRNA reads are calculated from STAR aligned bam files.

ADD REPLY
0
Entering edit mode
  1. Aligned against what?
  2. FPKMs calculated by what?

The exact details here are important.

ADD REPLY
0
Entering edit mode

Human samples aligned against human genome hg19(STAR default parameters),FPKMS calculated by cufflinks(default parameters)

ADD REPLY
0
Entering edit mode

As long as that contained the GL000228.1 contig then it contains 45S rRNA alignments. So since you used cufflinks it's likely that you're just seeing a difference in rRNA depletion. If that sample is causing problems then either exclude it or make an rRNA presence covariate that can be added to your GLM.

ADD REPLY

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6