Inconsistency Between Number Of Genes In Cuffdiff And Htseq
2
0
Entering edit mode
9.0 years ago
narges ▴ 180

Hi,

I have the same bam files from TopHat and I have used both cuffidd and HTseq(and then R package) to get DE genes. The problem is that the number of genes are not the same in HTseq and cuffdiff. I was expected not to get same number of counts for each gene but not same number of genes as the reference has been the same for both. I have more number of genes in HTseq output than in cuffdiff. Any idea about the reason would be appreciated.

cuffdiff htseq • 3.8k views
ADD COMMENT
3
Entering edit mode
9.0 years ago

The reason for this is that getting a list of differential expression is not an exact science. You will get different answers with different tools or even different versions of the same tool.

A list of genes on its own is meaningless. What is matters is the secondary layer of information that you derive from this list. Seemingly radically different sets of genes could still support the same hypothesis. Conversely nearly identical lists of genes could have contradictory interpretations.

Compare/validate hypotheses supported by or contradicted by the results and not the gene lists themselves.

ADD COMMENT
0
Entering edit mode

Sorry, maybe I did not explain well. I am not talking about difference un the number of DE genes but the difference between the number of whole input genes for these methods. So I have 23368 genes after running HTseq (which is before running DESeq to get DE genes). But I have 23284 genes as the whole number of genes in the genes_exp.diff file from cuffdiff regardless of the fact how many of them are DE genes and how many are not. As I have used the same bam file and also the same gtf file for both I was expected to have same total number of genes but ofcourse because of different algorithms they have there should be different number of DE genes.

ADD REPLY
1
Entering edit mode

Well that does not actually change what I said.

It does not matter if at any given point in the analysis you have fewer or more inputs when following one method over another. The only things that matter are the end result. If by the end of the analyses choosing one method gives you a radically different answer than the other then there is reason to worry. Right now it is premature to fret over a difference of 16 genes.

A robust observation should be reproducible by different analysis methods.

ADD REPLY
0
Entering edit mode

Yes, now I understand thank you.

ADD REPLY
2
Entering edit mode
9.0 years ago

I've found that Cufflinks (and presumably Cuffdiff) sometimes skips certain genes in the output. This has been mentioned several times on SeqAnswers (e g http://seqanswers.com/forums/showthread.php?t=19970 and http://seqanswers.com/forums/showthread.php?t=9222 there are others) and I think that someone has said that it is related to Cufflinks' algorithm "choking" on certain regions with very high coverage (sorry, I cannot locate the source now).

Different versions of Cufflinks (and presumably Cuffdiff) can also return different numbers of genes. By contrast, HTSeq always returns counts for all genes in the GTF file.

ADD COMMENT
0
Entering edit mode

Many thanks helped a lot.

ADD REPLY

Login before adding your answer.

Traffic: 2146 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6