Question: Cuffdiff takes extremely long time at "Testing for differential expression and regulation in locus"
0
gravatar for tunl
14 months ago by
tunl40
tunl40 wrote:

I am running Cuffdiff on our server, as follows:

./cuffdiff -o diff_out -b Bowtie2Index/genome.fa -p 8 --library-type fr-firststrand -L control,LS -u genes.gtf \
./result/CTR/accepted_hits.bam ./result/LS/accepted_hits.bam

where the two BAM files are the outputs from two Tophat2 runs on hg19 (using GENCODE annotation) with paired-end reads.

In “Calculating preliminary abundance estimates”, Cuffdiff processed 34053 loci.

At the step “Testing for differential expression and regulation in locus”, Cuffdiff became extremely slow: after 24 hours, it just progressed from:

Processing Locus chr1:33772366-33896653 [ ] 1%

to:

Processing Locus chr1:108614103-108617141 [* ] 4%

Is this normal?

Is there a way to speed up this process?

I’d greatly appreciate any ideas and suggestions.

Thank you very much!

rna-seq cuffdiff • 1.1k views
ADD COMMENTlink modified 14 months ago by Satyajeet Khare1000 • written 14 months ago by tunl40
1
gravatar for aditi.qamra
14 months ago by
aditi.qamra210
Singapore
aditi.qamra210 wrote:

Try running cuffquant to generate abundances.cxb file for each sample and then use .cxb files in cuffdiff to speed up the process.

ADD COMMENTlink written 14 months ago by aditi.qamra210

Thank you so much for your advice!

I looked into the new Cufflinks 2.2.0 workflow (http://cole-trapnell-lab.github.io/cufflinks/manual/), it says: “Cuffquant allows you to compute the gene and transcript expression profiles and save these profiles to files that you can analyze later with Cuffdiff or Cuffnorm. This can help you distribute your computational load over a cluster.”

And, in the Cufflinks 2.2.0 Release Notes, it says: “Cuffquant quantifies gene and transcript expression levels for a single BAM file. These levels are stored in a new binary file type, the CXB file… Because expression levels for each sample are quantified by Cuffquant, Cuffdiff doesn't have to perform this step, which speeds up Cuffdiff runs substantially and lowers their memory footprints.”

I am just slightly confused about in which way running 'Cuffquant + Cuffdiff" speeds up the process. Is the total process time of “Cuffquant + Cuffdiff” significantly shorter than running Cuffdiff (with BAM inputs) alone? Or does the new workflow mean distributing the computational load over a cluster?

Cuffquant provides pre-calculation of gene expression levels for each sample. I can see this saves time for multiple Cuffdiff runs, since multiple Cuffdiff runs don’t have to re-calculate gene expression levels for the same sample.

The problem I am having right now is that a single Cuffdiff run is extremely slow at the step “Testing for differential expression” (4% progress per day). So does running "Cuffquant + Cuffdiff" also speed up the process for a single Cuffdiff run?

Thank you very much for your help!

ADD REPLYlink modified 14 months ago • written 14 months ago by tunl40
1

Hi - What they mean by "distribute your computational load over a cluster" is that for individual files you can run cuffquant and then use the abundances.cxb file downstream rather than trying to estimate the abundances for all the files ( and then do differential analysis )in a single run of cuffdiff. Not only does this lessen the computational load but also significantly saves time in my experience. Running cuffquant+cuffdiff is splitting the cuffdiff with bam files step in two more manageable smaller steps.

ADD REPLYlink written 14 months ago by aditi.qamra210

Thank you very much for your further explanation!

So you mean, for a single Cuffdiff run with 2 BAM files, the total process time of “Cuffquant with BAM #1 + Cuffquant with BAM #2 + Cuffdiff” is significantly shorter than “Cuffdiff with 2 BAM files”, right?

When you say “trying to estimate the abundances for all the files”, do you mean the step of “Calculating preliminary abundance estimates” or something else?

In my case, the step of “Calculating preliminary abundance estimates” took just 19 min; but the step of “Testing for differential expression and regulation in locus” progressed only 4% after 24 hours. So I’m wondering if “trying to estimate the abundances for all the files” is actually a part of the step of “Testing for differential expression and regulation in locus”?

Thank you very much!

ADD REPLYlink written 14 months ago by tunl40
1

It is only shorter if you run Cuffquant for Bam1 and Bam2 in parallel ( separate jobs) -- Also totally depends on the size of your bam file ( and if your bam files are comparable to each other etc). But if the abundance estimation hardly took any time then maybe skshare's answer is right that learning the bias parameters is your choke point. Cuffquant will also take time at that step but since you can run the files in parallel it should still save you time than direct cuffdiff. I don't have the log of a successfully completed cuffdiff run to check if there is any additional step than quantifying abundances, learning the bias parameters and then testing for differential expression.

ADD REPLYlink written 14 months ago by aditi.qamra210

Thank you very much for your reply!

In fact, the step of “Learning bias parameters” only took 6 min in my run.

It’s the step of “Testing for differential expression and regulation in locus” that is extremely slow. Now 3 days have passed; this step only completed 10%. It’s doing something like: Processing Locus ……………………

From my log, it looks like that there are only “Calculating preliminary abundance estimates” and “Learning bias parameters” before “Testing for differential expression and regulation in locus”.

Thanks a lot!

ADD REPLYlink written 14 months ago by tunl40

I am curious now -- Why dont you run cuffquant + cuffdiff and tell us how much time did it take.

ADD REPLYlink written 14 months ago by aditi.qamra210

O.K., I'll give it a try. I need to upgrade cufflinks package since our current version is old and does not have cuffquant.

Thanks a lot for the help!

ADD REPLYlink written 14 months ago by tunl40
1
gravatar for Satyajeet Khare
14 months ago by
Satyajeet Khare1000
Pune, India
Satyajeet Khare1000 wrote:

If you do not use option -b it will be faster. I think -b is not required for cuffdiff. Its for cufflinks and does take time.

ADD COMMENTlink written 14 months ago by Satyajeet Khare1000

Thank you very much for your advice!

Both Cuffdiff and Cuffquant have the -b option, and the manual says “it can significantly improve accuracy of transcript abundance estimates.”

So I’m just wondering what may be the impact on the results if not using the –b option?

Some people said online that when they use the –b option, Cuffquant also runs forever for their case; but when they remove it, they get results fast.

Thank you very much for your help!

ADD REPLYlink written 14 months ago by tunl40
1

Hi,

This reference has tried cufflinks with and without -b option and found that use of -b makes the analysis much slower without detectable improvement in results. It also mentions that -b is for cufflinks. Even the online manual says following: "Providing Cufflinks with a multifasta file ... ". So I guessed that its for cufflinks than cuffdiff.

ADD REPLYlink modified 14 months ago • written 14 months ago by Satyajeet Khare1000
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1474 users visited in the last hour