Question

Difference in FPKM values of lncRNA when using different annotation files

0

Entering edit mode

6.9 years ago

piyushjo ▴ 710

Hi,

I was using gencode mouse annotation file vM17 for quantifying genes for RNA-seq. I am also interested in lncRNA quantification. I used two different annotation files. First to quantify all transcripts I used the comprehensive annotation file (primary assembly). Then to quantify just the lncRNA, I used the gencode mouse lncRNA annotation file. Now I know that the comprehensive files should have all the lncRNA, so I compared the FPKM values calculated from comprehensive and lncRNA annotation files.

What I observe is that the FPKM values are different. The trend is same, so for example in three condition if using comprehensive annotation file I get following values :A= 2, B=4, C=8; then when I use lncRNA annotation file I get A=6, B=11 , C=23 (example for representation purpose only). I just wanted to ask opinion of experts if I should use FPKM values from lncRNA annotation or the comprehensive file.

I am assuming that in the lncRNA notation, when the reads fall in a region that might overlap with mRNA, it is counted towards lncRNA; as there is no mRNA annotation. However, in case of comprehensive annotation; the read is decided based on where the overlap is more prominent. This is just my thinking.

Please guide me understand what should be my choice: comprehensive or lncrna?

Thanks!!

gencode reference annotations • 2.1k views

ADD COMMENT • link updated 6.9 years ago by grant.hovhannisyan ★ 2.6k • written 6.9 years ago by piyushjo ▴ 710

score 2 · Answer 1 · 2018-08-18

2

Entering edit mode

6.9 years ago

grant.hovhannisyan ★ 2.6k

IMHO, when you use only lncRNA annotations, your library size (total number of mapped reads overlapping features) is always less than when you use a comprehensive annotation. Thus, you always get higher FPKM values when you use only lncRNA annotations. There is a recent bioarxiv paper addressing this exact issue https://www.biorxiv.org/content/early/2018/01/09/241869, might be helpful for you (not peer-reviewed though). The authors claim that pseudoalignment software like salmon/kallisot alongside with full genome annotations have advantages over other combinations of methods.

ADD COMMENT • link 6.9 years ago by grant.hovhannisyan ★ 2.6k

2

Entering edit mode

To add to what Grant has said, which is perfectly valid, I have to state that FPKM should not be used anymore. There is a very well cited manuscript ( A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis ) that states the following:

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin and Grant.

I am using stringtie to convert bams to gtf (that contain the fpkm and tpm values). There is also a python script by the group that can convert it into reads. But I guess that the algorithm converts fpkm into reads. Do you have experience with that? Just want to make sure that the algorithm doesn't suffer from the same bias.

ADD REPLY • link 6.9 years ago by piyushjo ▴ 710

2

Entering edit mode

You have (at least) two trustworthy and reliable ways to convert your bam files to read counts:

Use featurecounts - the most straightforward way, basically will count number of reads overlapping features in gff/gtf file. You will generate gene-level quantifications.
If you have used stringtie to generate fpkm/tmp values (stringtie makes transcript level quantifications), then you can use tximport to convert TPMs to read counts. If you are planning to do dif. gene expression analysis, the second option is more advisable according to https://f1000research.com/articles/4-1521/v1

ADD REPLY • link 6.9 years ago by grant.hovhannisyan ★ 2.6k