Question

Low number of differentially expressed genes

0

Entering edit mode

9.4 years ago

Biogeek ▴ 480

Dear all,

Pipeline is as follows: Trinity (min_kmer_cov 3, min_glue 3), CD-HIT-EST followed by CAP3 (for redundancy removal). RSEM for calculating expression values, followed by EdgeR for differential expression.

After redundancy removal I have about 300,000 unigenes, which seems still high (joined cap3 contigs and singletons).

Now my problem is in the FPKM values for RSEM, I seem to be getting a lot of 0 values for transcripts, and for differential expression analysis (fold change 2, fdr =0.05) I am getting around 100 differentially expressed genes between some conditions. I was expecting a lot more. In some comparisons there are even only 10-30 differently expressed genes.

Anyone have any idea what could be wrong? I am a bit of a noob to transcriptomics. The RSEM reads provided between 85-95% alignment success.

Thanks

RNA-Seq differential-expression edgeR • 3.4k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.4 years ago by Biogeek ▴ 480

0

Entering edit mode

What did you give as input to edgeR, the raw counts or FPKM? If the latter, then this is likely the reason. How many replicates? If there are less than 3 this might be another reason. 300k transcripts are a lot, I guess there is no reference genome. No real organism has 300k real different transcripts, so that to artificial inflation of the number of test conducted by one order of magnitude. More tests mean higher adjusted p-values. So I propose to somehow further reduce the number of unigenes.

ADD REPLY • link 9.4 years ago by Michael 56k

0

Entering edit mode

Input was the matrix TMM file created in EdgeR. Any idea why trinity is assembling 600,000 odd transcripts? which when treated with cdhitest and cap3 produce 300k?

ADD REPLY • link 9.4 years ago by Biogeek ▴ 480

0

Entering edit mode

Can you provide sequencing parameters: organism, is there a reference genome, approx. genome size, sequencing technologies used, paired/single end, read length, number of reads?

You should use the raw read integer counts as input for the de test in edgeR, not normalized counts by any means, if the input counts contain fractional numbers, that is an indication that they have been normalized.

Try to follow the simple approach, namely sections 2.5 (creating DGElist) and 2.9 (estimation dispersion and DE of two groups) in the documentation: https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf assuming you have replicated and only a single factor/two group comparison. Then post the output of

y # (given your dge.list is in y)
topTags(et) # your test result is in et

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 9.4 years ago by Michael 56k