I was curious how much the -M (mask file) option can improve the FPKM from Cufflinks. From the mannual, it says
-M/--mask-file <mask.(gtf gff)>="" <br=""/> Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.
So, I would expect by providing the mask file containing rRNA, tRNA, mt genes etc. will decrease the "total mapped reads" (e.g. denominator), which will lead a increased FPKM. But actually what I see is, for most mRNA genes, the FPKM values with -M option are smaller than that without -M. See attached figures (e.g. I expect most of the dots are under the red dotted line, which is x=y).
I have to admit that -M indeed can reduce a lot of the FPKM for rRNA genes. But still, it's mysterious why most mRNA genes have lower FPKM after applying -M option. Does anyone have similar observation?
btw, here is my cufflinks arguments with -M:
cufflinks --library-type fr-unstranded -o cufflink_w_M -p 8 -G /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/gencode.v13.annotation.karotyped.gtf -M /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/chrM.rRNA.tRNA.gtf --multi-read-correct accepted_hits.bam
and without -M:
cufflinks --library-type fr-unstranded -o cufflink_wo_M -p 8 -G /data/iGenome/Homo_sapiens/UCSC/hg19/Annotation/Genes/gencode.v13.annotation.karotyped.gtf --multi-read-correct accepted_hits.bam