fpkm values mismatch
Entering edit mode
3 months ago
ranjits • 0

I ran topmed RNA seq program on GRCh38 for a set of fastq files (belonging to different samples) that were previously aligned to GRch37 (this will be called study).

I observed that there are a list of genes that had fpkm > 0 in the original study, but from topmed's RNAseqC output, these genes have fpkm = 0.

1) Is this behavior normal?

2) Is it ok to compare fpkm values like this?

3) If this behavior is not normal and it is ok to compare fpkm values, what are the possible causes for this mismatch in fpkm for the same gene_id ?

4) What reference files (eg: .fasta, .gtf, etc.) do I need to check to identify the cause(s) for this issue?

5) What additional inputs do we need to analyse this issue?

rna topmed seq fpkm • 211 views
Entering edit mode
3 months ago
seidel 8.3k

FPKM values derived from mapping of fastq files will depend on (1) the genome (fasta), (2) the gene descriptions (gtf), and (3) the aligner. For your comparison, you've changed at least two of these things. Thus, it is expected that some values will change. Different genomes are different, and they will affect the mapping process. Different sets of gene annotations may also be different, and will affect how reads are counted for genes. People will argue about the best values for comparison, but either way, FPKM values are best compared within a highly constrained universe and are not absolute measurements of abundance. If you were to plot all values from your comparisons against each other, you would find that mostly they will align along a diagonal and be virtually identical, but you will see scatter and some outliers. You can look into the details of why some individual genes differ....but what is it that you actually want to achieve? You're better using a consistent set of resources to answer a defined set of questions.

So: (1) Yes, it is normal to get different results when you change the resources used to answer your questions.

(2) No, I would not recommend comparing FPKM values this way, as they were generated in different contexts.

(3) Different versions of genomes have different regions available for mapping, and depending on your alignment parameters this will affect how reads are assigned to genes. In addition different genomes require different sets of gene descriptions. A given gene ID may have a different set of descriptions (transcripts) between genome versions. This is a common problem.

(4) You need to check the GTF, as well as the genome (fasta), if you want to find out why a given gene changes values between genome versions.

(5) You don't say explicitly that the same aligner, and indeed the same version of the aligner, was used between comparisons. Aligners, and the parameters they are called with, can make a big difference in FPKM values for some genes.

If you really want to get to the bottom of the differences, you should reproduce each result set in your own hands, so you know all the parameters, and can examine the differences in an isolated fashion - just like any good experiment.

Entering edit mode

Thanks a lot seidel, your reply is very informative.

Original study (against GRch37) used BWA whereas topmed (against GRch38) used STAR aligner.

Topmed used GENCODE v30 gtf, whereas orginal study used GENCODE v13 annotations.

The .fasta used for Topmed is GRCh38 reference from the Broad Institute, whereas the original study used GRCh37

Once again, thanks a lot for taking the time to respond to my query


Login before adding your answer.

Traffic: 1733 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6