Question: Falcon - Estimation of free disc space needed
0
gravatar for alslonik
4 months ago by
alslonik50
Israel
alslonik50 wrote:

Hi, I am running a de novo assembly of a plant ~500Mb genome from PacBio reads of 120X coverage with Falcon. It runs for about 5 days now, done ~20% of the 160000 jobs that it has, and the disc space it takes is 2.5T currently. My free disc is up to 6T, meaning that I have 3.5T free. I am trying to estimate the size that I will need. What is the behaviour of Falcon regarding the temp files that it creates and how much free space on a disc it will need...? Is it something that can be calculated?

pacbio falcon assembly • 229 views
ADD COMMENTlink modified 4 months ago • written 4 months ago by alslonik50
2
gravatar for alslonik
4 months ago by
alslonik50
Israel
alslonik50 wrote:

OK, So I am answering to myself just in case someone who reads it will need it. It is abnormal to have these sizes and these numbers of jobs with Falcon trying to assemble a less than 1G genome with a 120X coverage. A Dbsplit option in the cfg file that takes care of the block size at the dealigner stage is very important for the run. In my case I was running it with -s 50 and once changed to -s 400 it reduced the number of jobs to 196!!, runtime to 34h and disc size to 1.3T in total.

ADD COMMENTlink written 4 months ago by alslonik50
1

Oof, 34h to finish a run compared to 5days to 20% of the run, that's some nice improvements :) Glad you found a solution !

ADD REPLYlink written 4 months ago by Roxane Boyer740

Also, just in case someone needs - here are most of the answers to my questions :

https://github.com/PacificBiosciences/FALCON/wiki/Somethings-to-think-about-for-tuning-assembly-parameters

Important text for newbies like me assembling de novo for the first time.

ADD REPLYlink written 4 months ago by alslonik50
1
gravatar for Roxane Boyer
4 months ago by
Roxane Boyer740
France / Toulouse / GeT-Plage
Roxane Boyer740 wrote:

Hello alslonik !

Even if I did not used Falcon but Canu, I once had a similar issue while trying to assemble a large genome with PacBio data and a large depth. I think that calculate the memory needed is higly dependant on the parameters you used for your run, but is not that easy to predict.

I'm not sure if Falcon delete temporary files, and if yes, when will he do so. In order you to avoid waiting for several days of assembling, and then abort spaces issue, you could try to read this : https://github.com/marbl/canu/issues/193 , an old post where I could have a lot of feed back from the Canu dev, they helped me understanding why I was producing so much data and how I could lower the volume of produced data (filtering the raw reads for example using a histogram of reads length for example).

In conclusion, buying a disc is not the only solution, you may be able to produce less data by filterings your reads, and playing with some assembly parametrs (like the minimum size for overlapp of 2 reads.) , I'll be glad to do my best to discuss about this !

Hope this was a bit helpfull,

Cheers,

Roxane

ADD COMMENTlink written 4 months ago by Roxane Boyer740

Thanks, Falcon does some internal filtering at the beginning, but yes - I guess you are right, and if there will no be additional option I will filter the reads! I ll look deeper into these settings parameters...Thank you!

ADD REPLYlink written 4 months ago by alslonik50
1

You are welcome :) Glad I could help

ADD REPLYlink written 4 months ago by Roxane Boyer740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1600 users visited in the last hour