Question: Low number of differentially expressed genes
gravatar for Biogeek
2.5 years ago by
Biogeek280 wrote:

Dear all,




Pipeline is as follows: Trinity (min_kmer_cov 3, min_glue 3), CDHITEST followed by CAP3 (for redundancy removal). RSEM for calculating expression values, followed by EdgeR for differential expression.

After redundancy removal I have about 300,000 unigenes, which seems still high (joined cap3 contigs and singletons).

Now my problem is in the FPKM values for RSEM, I seem to be getting a lot of 0 values for transcripts, and for differential expression analysis (fold change 2, fdr =0.05) I am getting around 100 differentially expressed genes between some conditions. I was expecting a lot more. In some comparisons there are even only 10-30 differently expressed genes.

Anyone have any idea what could be wrong? I am a bit of a noob to transcriptomics. The RSEM reads provided between 85-95% alignment success.


ADD COMMENTlink written 2.5 years ago by Biogeek280

What did you give as input to edgeR, the raw counts or FPKM? If the latter, then this is likely the reason. How many replicates? If there are less than 3 this might be another reason. 300k transcripts are a lot, I guess there is no reference genome. No real organism has 300k real different transcripts, so that to artificial inflation of the number of test conducted by one order of magnitude. More tests mean higher adjusted p-values. So I propose to somehow further reduce the number of unigenes.

ADD REPLYlink written 2.5 years ago by Michael Dondrup44k

Input was the matrix TMM file created in EdgeR. Any idea why trinity is assembling 600,000 odd transcripts? which when treated with cdhitest and cap3 produce 300k?

ADD REPLYlink written 2.5 years ago by Biogeek280

Can you provide sequencing parameters: organism, is there a reference genome, approx. genome size, sequencing technologies used, paired/single end, read length, number of reads?

You should use the raw read integer counts as input for the de test in edgeR, not normalized counts by any means, if the input counts contain fractional numbers, that is an indication that they have been normalized.

Try to follow the simple approach, namely sections 2.5 (creating DGElist) and 2.9 (estimation dispersion and DE of two groups) in the documentation: assuming you have replicated and only a single factor/two group comparison. Then post the output of

y # (given your dge.list is in y)

topTags(et) # your test result is in et





ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Michael Dondrup44k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 664 users visited in the last hour