Question

Why there are several different genes with exact same FPKM in the genes.fpkm_tracking of cuffdiff?

1

Entering edit mode

9.3 years ago

Parham ★ 1.6k

Hi,

I am willing to filter the highly transcribed genes of my experiment based on their fpkm values. First I need a confirmation whether that is a correct strategy?

If so I am filtering from genes.fpkm_tracking of the cuffdiff outputs not the rest (cds, isoforms or tss). I hope that is correct also.

Then I am wondering why several different genes show up with exact fpkm value in the genes.fpkm_tracking file? Can some body explain that please? And let's assume initially I select the 500 highest fpkms, then I split those genes and I end up with 1000 genes. Now my aim was to filter the top 500 highly transcribed genes, that 1000 gene result will ruin it so I have to select for less genes for a second time! Is that how it works?

Thanks in advance!

cuffdiff fpkm • 3.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Parham ★ 1.6k

Ram · Answer 1 · 2015-02-09

1

Entering edit mode

9.3 years ago

Devon Ryan 104k

Filtering by level of evidence typically a worthwhile step. I'll assume that you're aware with the myriad of problems surrounding FPKMs, so I'll not go into that. Well, I guess I'll mention that it's somewhat less useful to filter by FPKM than by raw or expected count.

Anyway, there's no reason that two genes can't have the same FPKM. That's essentially just a count metric normalized by length, so you could even have all genes with the same FPKM (this would never happen in practice). There's no reason to expect uniqueness here.

I have no idea what you mean by "split those genes".

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for explanation on why genes can have same FPKM. Previously I filtered on counts produced from DESeq2 but then I thought maybe its more accurate to do it on FPKM and compare the outcome. I have no idea about problems surrounding FPKMs, so if you could clarify or share a link I appreciate that. So your recommendation is to stick to the counts? However I am not familiar with raw/expected counts term. A newbie here!

By splitting I meant splitting the string characters. Now you don't need to consider it, you gave the answer already.

Thanks a lot!

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Parham ★ 1.6k

1

Entering edit mode

One of the biggest problems with FPKMs is that they're easily skewed by differences in abundance of a single highly expressed gene. Since samples will have different amounts of rRNA depletion and rRNAs are the majority of transcripts this can create real problems. If you calculate FPKM from normalized counts, then this isn't a problem.

The other issue with FPKMs is that by normalizing by gene length, you're throwing away genes for which you have more statistical power.

Expected counts are produced by programs like RSEM and eXpress (cufflinks uses them, but I don't know if it actually outputs them). They're essentially just standard/raw counts that include ambiguous alignments via an expectation maximization algorithm. Read the papers on any of the aforementioned tools for details.

Yes, sticking with counts is typically the most useful route. I did have one experiment where making FPKMs from the normalized counts and filtering according to that had the most power, but that's more the exception than the rule.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the very thorough explanation Devon. However that made me to come up with more questions!

I would like to ask how can I calculate FPKM from normalized counts? The only way I know to make normalized counts is using DESeq package but then I could have your advice how to calculate back the FPKM from it.

Next thing regarding eXpress! Now I am trying it but since I work on S. pombe and all the analysis I did so far was on a workstation, I am doing eXpress the same but it seems it is taking a very loooooong time (over-night and running) creating hits.bam which I assume is equal to acceptedhits.bam from Tuxedo. The file is very big and is expanding. Its 40GB right now in eXpress comparing to about 0.5GB in Tuxedo output. Is it the way it works?

Thank you so much for all the helps.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Parham ★ 1.6k

1

Entering edit mode

From normalized counts, just divide by the length of each gene and then by 1 million.

Regarding express, that seems a bit over the top in terms of file size. You might look through that file and see if it's just writing gibberish or repeating alignments or something like that.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for the guidance. I tried to create FPKM from counts but I got some questions that if you would read what I've done below I would like to ask.

library(GenomicFeatures)
txdb <- loadDb("txdb_fungi_mart_23_2014DEC.sqlite")
exByGn <- exonsBy(txdb, "gene")
library('GenomicAlignments')
fls <- list.files(c("path/to/acceptedhits.bam")), pattern="bam$", full.names = T)
bamfls <- BamFileList(fls)
flag <- scanBamFlag(isNotPrimaryRead = F)
param <- ScanBamParam(flag=flag)
gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand = T, param=param)
cnts=assay(gnCnt)
library(DESeq)
condition = factor(c("control", "control", "control","treatment", "treatment", "treatment"))
cds = newCountDataSet(cnts, condition)
cds=estimateSizeFactors(cds)
sizeFactors(cds)
counts <- counts(cds, normalized=T)
counts.sel <- counts[, c(1,2,3)]
counts.sum <- rowSums(counts.sel)

Then here the values I get for counts.sum vary between hundreds to several thousands. Therefore, when I divide them to gene-length and then a million the resulted FPKM will become so small. Is it correct? However this doesn't matter if I just want to sort them and pick the highly transcribed genes but in case of expectations for those values I am asking.

I have another question regarding creating gene-length. I am wondering if this will do for creating them since I am not very familiar with txdb s.

txdb <- loadDb("txdb_fungi_mart_23_2014DEC.sqlite")
txsByGene=transcriptsBy(txdb, "gene")
lengthData=median(width(txsByGene))

Please let me know what you think and thanks in advance.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.2 years ago by Parham ★ 1.6k

1

Entering edit mode

Remember that gene length is in kilobases, otherwise you'll get 1000x smaller numbers than you should. Anyway, in general, RPKM values will be much smaller than the normalized (or raw) counts. The exception is short genes (namely, anything <1kb).

Here is an R script that computes union-gene model gene lengths (among other things). I would personally follow that route. You'd just use the txdb for your fungus to get the exons split by gene, reduce that and then sum the widths.

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

1

Entering edit mode

Thanks Devon. I had GTF and produced the lengths with your codes and the output was similar to above mentioned codes if it is changed as below. Just in case some one else is reading this and interested for a shorter way.

exByGene <- exonsBy(txdb, "gene")
lengthData <- sum(width(exByGene))

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.2 years ago by Parham ★ 1.6k