Question

RNA-seq quality control

0

Entering edit mode

7.4 years ago

Ron ★ 1.2k

Hi,

While looking at the RNA-seq mapping statistics from STAR(same batch,same library prep used for sequencing),I often see different numbers of mapped reads ranging from 18 million to 40 million mapped reads.Both kinds of samples(20 million mapping reads vs 40 million mapping reads) can give median of log of fpkms to be around 1, which I think is considered a good quality sample for downstream processing.

However,despite of having lets say 40 million mapping reads,some samples end up having median of log of fpkms to be around 0.5.

I have few questions in this regard.

Can we compare this sample in the same cohort as the one that has median of log of fpkms to be around 1 (both having 40 million mapped reads)?

Can we compare a sample with 20 million mapped reads with median log of fpkms around 0.5 to a sample with 40 million mapped reads with median of log of fpkms around 1 , or vice-versa(it is also possible) ?

Is there any minimum number of mapped reads that should be taken as a threshold e.g 20 million mapped reads for passing QC?

If we can not compare them directly,should we do batch correction for comparing them even though they are from same batch?

Lastly,I am using(http://rseqc.sourceforge.net/#spilt-bam-py) for getting a count of rRNA reads,do these reads form a part of mapped reads or they are a part of overall input reads?Since I am taking the mapped bam file,my guess is they are included on only mapped reads and if they are high in number-- this results in lower quality of FPKMS even though we have 40 million mapped reads or greater(I mean sufficient mapped reads)

Thanks,

Ron

RNA-Seq next-gen alignment QC STAR • 1.9k views

ADD COMMENT • link updated 7.4 years ago by Devon Ryan 104k • written 7.4 years ago by Ron ★ 1.2k

score 2 · Answer 1 · 2016-11-10

2

Entering edit mode

7.4 years ago

Devon Ryan 104k

FPKMs shouldn't be used for any actual statistics, so having a different median value isn't an issue.
The absolute number of mapped reads isn't the important thing, rather the difference in numbers. In general, you tend to run into issues when there's more than a 10x difference in the number of alignments between libraries.
You don't have a batch to correct for.
Again, you shouldn't do statistics with FPKMs. Just take the counts from featureCounts (or STAR, which I think can directly output them these days).

ADD COMMENT • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon,

Thanks! I have an example where samples with similar mapping statistics with similar quality of FPKMS but there is one sample in the batch with similar mapping statistics but different FPKMS quality clustering separately.I am not sure whether its a real difference or a batch effect.All samples are of same disease RNASeq.

On other note,I had another question whether rRNA reads are included in the BAM File as well or just are in total reads?

ADD REPLY • link 7.4 years ago by Ron ★ 1.2k

0

Entering edit mode

Whether rRNA alignments are included in the BAM for and/or FPKMs is dependent on how you made both. My guess is that your FPKMs are calculated using total reads, rather than mapped reads, and that you just have differences in rRNA amounts between the samples.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

FPKMS and rRNA reads are calculated from STAR aligned bam files.

ADD REPLY • link 7.4 years ago by Ron ★ 1.2k

0

Entering edit mode

Aligned against what?
FPKMs calculated by what?

The exact details here are important.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Human samples aligned against human genome hg19(STAR default parameters),FPKMS calculated by cufflinks(default parameters)

ADD REPLY • link 7.4 years ago by Ron ★ 1.2k

0

Entering edit mode

As long as that contained the GL000228.1 contig then it contains 45S rRNA alignments. So since you used cufflinks it's likely that you're just seeing a difference in rRNA depletion. If that sample is causing problems then either exclude it or make an rRNA presence covariate that can be added to your GLM.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k