Question: Inconsistency Between Number Of Genes In Cuffdiff And Htseq
gravatar for narges
8.0 years ago by
narges180 wrote:


I have the same bam files from TopHat and I have used both cuffidd and HTseq(and then R package) to get DE genes. The problem is that the number of genes are not the same in HTseq and cuffdiff. I was expected not to get same number of counts for each gene but not same number of genes as the reference has been the same for both. I have more number of genes in HTseq output than in cuffdiff. Any idea about the reason would be appreciated.

htseq cuffdiff • 3.5k views
ADD COMMENTlink modified 8.0 years ago by Mikael Huss4.7k • written 8.0 years ago by narges180
gravatar for Istvan Albert
8.0 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

The reason for this is that getting a list of differential expression is not an exact science. You will get different answers with different tools or even different versions of the same tool.

A list of genes on its own is meaningless. What is matters is the secondary layer of information that you derive from this list. Seemingly radically different sets of genes could still support the same hypothesis. Conversely nearly identical lists of genes could have contradictory interpretations.

Compare/validate hypotheses supported by or contradicted by the results and not the gene lists themselves.

ADD COMMENTlink written 8.0 years ago by Istvan Albert ♦♦ 85k

Sorry, maybe I did not explain well. I am not talking about difference un the number of DE genes but the difference between the number of whole input genes for these methods. So I have 23368 genes after running HTseq (which is before running DESeq to get DE genes). But I have 23284 genes as the whole number of genes in the genes_exp.diff file from cuffdiff regardless of the fact how many of them are DE genes and how many are not. As I have used the same bam file and also the same gtf file for both I was expected to have same total number of genes but ofcourse because of different algorithms they have there should be different number of DE genes.

ADD REPLYlink written 8.0 years ago by narges180

Well that does not actually change what I said.

It does not matter if at any given point in the analysis you have fewer or more inputs when following one method over another. The only things that matter are the end result. If by the end of the analyses choosing one method gives you a radically different answer than the other then there is reason to worry. Right now it is premature to fret over a difference of 16 genes.

A robust observation should be reproducible by different analysis methods.

ADD REPLYlink written 8.0 years ago by Istvan Albert ♦♦ 85k

Yes, now I understand thank you.

ADD REPLYlink written 8.0 years ago by narges180
gravatar for Mikael Huss
8.0 years ago by
Mikael Huss4.7k
Mikael Huss4.7k wrote:

I've found that Cufflinks (and presumably Cuffdiff) sometimes skips certain genes in the output. This has been mentioned several times on SeqAnswers (e g and there are others) and I think that someone has said that it is related to Cufflinks' algorithm "choking" on certain regions with very high coverage (sorry, I cannot locate the source now).

Different versions of Cufflinks (and presumably Cuffdiff) can also return different numbers of genes. By contrast, HTSeq always returns counts for all genes in the GTF file.

ADD COMMENTlink written 8.0 years ago by Mikael Huss4.7k

Many thanks helped a lot.

ADD REPLYlink written 8.0 years ago by narges180
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1852 users visited in the last hour