Question: Why there are several different genes with exact same FPKM in the genes.fpkm_tracking of cuffdiff?
1
gravatar for Parham
4.7 years ago by
Parham1.4k
Sweden
Parham1.4k wrote:

Hi,

I am willing to filter the highly transcribed genes of my experiment based on their fpkm values. First I need a confirmation whether that is a correct strategy?

If so I am filtering from genes.fpkm_tracking of the cuffdiff outputs not the rest (cds, isoforms or tss). I hope that is correct also. 

Then I am wondering why several different genes show up with exact fpkm value in the genes.fpkm_tracking file? Can some body explain that please? And let's assume initially I select the 500 highest fpkms, then I split those genes and I end up with 1000 genes. Now my aim was to filter the top 500 highly transcribed genes, that 1000 gene result will ruin it so I have to select for less genes for a second time! Is that how it works?

Thanks in advance!

ADD COMMENTlink modified 4.7 years ago by Devon Ryan92k • written 4.7 years ago by Parham1.4k
1
gravatar for Devon Ryan
4.7 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

Filtering by level of evidence typically a worthwhile step. I'll assume that you're aware with the myriad of problems surrounding FPKMs, so I'll not go into that. Well, I guess I'll mention that it's somewhat less useful to filter by FPKM than by raw or expected count.

Anyway, there's no reason that two genes can't have the same FPKM. That's essentially just a count metric normalized by length, so you could even have all genes with the same FPKM (this would never happen in practice). There's no reason to expect uniqueness here.

I have no idea what you mean by "split those genes".

ADD COMMENTlink written 4.7 years ago by Devon Ryan92k

Thanks for explanation on why genes can have same FPKM. Previously I filtered on counts produced from DESeq2 but then I thought maybe its more accurate to do it on FPKM and compare the outcome. I have no idea about problems surrounding FPKMs, so if you could clarify or share a link I appreciate that. So your recommendation is to stick to the counts? However I am not familiar with raw/expected counts term. A newbie here!

By splitting I meant splitting the string characters. Now you don't need to consider it, you gave the answer already. 

Thanks a lot!

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Parham1.4k
1

One of the biggest problems with FPKMs is that they're easily skewed by differences in abundance of a single highly expressed gene. Since samples will have different amounts of rRNA depletion and rRNAs are the majority of transcripts this can create real problems. If you calculate FPKM from normalized counts, then this isn't a problem.

The other issue with FPKMs is that by normalizing by gene length, you're throwing away genes for which you have more statistical power.

Expected counts are produced by programs like RSEM and eXpress (cufflinks uses them, but I don't know if it actually outputs them). They're essentially just standard/raw counts that include ambiguous alignments via an expectation maximization algorithm. Read the papers on any of the aforementioned tools for details.

Yes, sticking with counts is typically the most useful route. I did have one experiment where making FPKMs from the normalized counts and filtering according to that had the most power, but that's more the exception than the rule.
 

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

Thanks for the very thorough explanation Devon. However that made me to come up with more questions! 

I would like to ask how can I calculate FPKM from normalized counts? The only way I know to make normalized counts is using DESeq package but then I could have your advice how to calculate back the FPKM from it. 

Next thing regarding eXpress! Now I am trying it but since I work on S. pombe and all the analysis I did so far was on a workstation, I am doing eXpress the same but it seems it is taking a very loooooong time (over-night and running) creating hits.bam which I assume is equal to acceptedhits.bam from Tuxedo. The file is very big and is expanding. Its 40GB right now in eXpress comparing to about 0.5GB in Tuxedo output. Is it the way it works? 

Thank you so much for all the helps. 

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Parham1.4k
1

From normalized counts, just divide by the length of each gene and then by 1 million.

Regarding express, that seems a bit over the top in terms of file size. You might look through that file and see if it's just writing gibberish or repeating alignments or something like that.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k

Thanks for the guidance. I tried to create FPKM from counts but I got some questions that if you would read what I've done below I would like to ask.

library(GenomicFeatures)
txdb <- loadDb("txdb_fungi_mart_23_2014DEC.sqlite")
exByGn <- exonsBy(txdb, "gene")
library('GenomicAlignments')
fls <- list.files(c("path/to/acceptedhits.bam")), pattern="bam$", full.names = T)
bamfls <- BamFileList(fls)
flag <- scanBamFlag(isNotPrimaryRead = F)
param <- ScanBamParam(flag=flag)
gnCnt <- summarizeOverlaps(exByGn, bamfls, mode="Union", ignore.strand = T, param=param)
cnts=assay(gnCnt)
library(DESeq)
condition = factor(c("control", "control", "control","treatment", "treatment", "treatment"))
cds = newCountDataSet(cnts, condition)
cds=estimateSizeFactors(cds)
sizeFactors(cds)
counts <- counts(cds, normalized=T)
counts.sel <- counts[, c(1,2,3)]
counts.sum <- rowSums(counts.sel)

Then here the values I get for counts.sum vary between hundreds to several thousands. Therefore, when I divide them to gene-length and then a million the resulted FPKM will become so small. Is it correct? However this doesn't matter if I just want to sort them and pick the highly transcribed genes but in case of expectations for those values I am asking. 

I have another question regarding creating gene-length. I am wondering if this will do for creating them since I am not very familiar with txdb s.

txdb <- loadDb("txdb_fungi_mart_23_2014DEC.sqlite")
txsByGene=transcriptsBy(txdb, "gene")
lengthData=median(width(txsByGene))

Please let me know what you think and thanks in advance. 

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Parham1.4k
1

Remember that gene length is in kilobases, otherwise you'll get 1000x smaller numbers than you should. Anyway, in general, RPKM values will be much smaller than the normalized (or raw) counts. The exception is short genes (namely, anything <1kb).

Here's an R script that computes union-gene model gene lengths (among other things). I would personally follow that route. You'd just use the txdb for your fungus to get the exons split by gene, reduce that and then sum the widths.

ADD REPLYlink written 4.7 years ago by Devon Ryan92k
1

Thanks Devon. I had GTF and produced the lengths with your codes and the output was similar to above mentioned codes if it is changed as below. Just in case some one else is reading this and interested for a shorter way. 

exByGene <- exonsBy(txdb, "gene")
lengthData <- sum(width(exByGene))
ADD REPLYlink written 4.7 years ago by Parham1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2038 users visited in the last hour