BUSCOs running time
0
1
Entering edit mode
4.9 years ago
qwzhang0601 ▴ 70

Hello:

I am trying to use Busco to train Augustus for a new genome. I am training it on 20 nodes and it has been running for about 20 days. I wonder whether there is some information that I can used to estimate how long will it take for the training.

After I get the following information, I do not have updated information. Can I estimated how long will it take?

WARNING 02/17/2017 19:23:40 => Optimizing augustus metaparameters, this may take a very long time...

#######
#below is the detail message from Busco
INFO    ****************** Start a BUSCO 2.0 analysis, current time: 02/17/2017 13:29:03 ******************
INFO    The lineage dataset is: mammalia_odb9 (eukaryota)
INFO    Mode is: genome
INFO    Maximum number of regions limited to: 3
INFO    To reproduce this run: python /public/apps/busco/v2.0/python.2.7.8/BUSCO.py -i ../rawData/CasCan.a.10000.fasta -o Beaver -l /public/apps/busco/v2.0/python.2.7.8/mammalia_
odb9/ -m genome -c 20 --long -sp human
INFO    Check dependencies...
INFO    Check input file...
INFO    Temp directory is ./tmp/

INFO    ****** Phase 1 of 2, initial predictions ******
INFO    ****** Step 1/3, current time: 02/17/2017 13:29:18 ******
INFO    Create blast database...
INFO    [makeblastdb]   Building a new DB, current time: 02/17/2017 13:29:19
INFO    [makeblastdb]   New DB name:   /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/tmp/Beaver_1784058097
INFO    [makeblastdb]   New DB title:  ../rawData/CasCan.a.10000.fasta
INFO    [makeblastdb]   Sequence type: Nucleotide
INFO    [makeblastdb]   Keep MBits: T
INFO    [makeblastdb]   Maximum file size: 1000000000B
INFO    [makeblastdb]   Adding sequences from FASTA; added 10000 sequences in 14.3675 seconds.
INFO    Running tblastn, writing output to /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/blast_output/tblastn_Beaver.tsv...
INFO    ****** Step 2/3, current time: 02/17/2017 14:13:07 ******
INFO    Getting coordinates for candidate regions...
INFO    Pre-Augustus scaffold extraction...
INFO    Running Augustus prediction using human as species:
INFO    [augustus] Please find all logs related to Augustus here: /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/augustus_output/augustus.log
INFO    02/17/2017 14:13:24 =>  0% of predictions performed (4453 to be done)
INFO    02/17/2017 14:21:55 =>  10% of predictions performed (490/4453 candidate regions)
INFO    02/17/2017 14:30:23 =>  20% of predictions performed (936/4453 candidate regions)
INFO    02/17/2017 14:38:25 =>  30% of predictions performed (1381/4453 candidate regions)
INFO    02/17/2017 14:46:05 =>  40% of predictions performed (1826/4453 candidate regions)
INFO    02/17/2017 14:53:21 =>  50% of predictions performed (2272/4453 candidate regions)
INFO    02/17/2017 15:00:53 =>  60% of predictions performed (2717/4453 candidate regions)
INFO    02/17/2017 15:08:04 =>  70% of predictions performed (3162/4453 candidate regions)
INFO    02/17/2017 15:15:06 =>  80% of predictions performed (3607/4453 candidate regions)
INFO    02/17/2017 15:23:43 =>  90% of predictions performed (4053/4453 candidate regions)
INFO    02/17/2017 15:31:05 =>  100% of predictions performed
INFO    Extracting predicted proteins...
INFO    ****** Step 3/3, current time: 02/17/2017 15:32:45 ******
INFO    Running HMMER to confirm orthology of predicted proteins:
INFO    02/17/2017 15:32:45 =>  0% of predictions performed (4238 to be done)
INFO    02/17/2017 15:32:51 =>  10% of predictions performed (468/4238 candidate proteins)
INFO    02/17/2017 15:33:02 =>  20% of predictions performed (891/4238 candidate proteins)
INFO    02/17/2017 15:33:21 =>  30% of predictions performed (1315/4238 candidate proteins)
INFO    02/17/2017 15:33:47 =>  40% of predictions performed (1739/4238 candidate proteins)
INFO    02/17/2017 15:34:20 =>  50% of predictions performed (2163/4238 candidate proteins)
INFO    02/17/2017 15:35:00 =>  60% of predictions performed (2586/4238 candidate proteins)
INFO    02/17/2017 15:35:47 =>  70% of predictions performed (3009/4238 candidate proteins)
INFO    02/17/2017 15:36:41 =>  80% of predictions performed (3433/4238 candidate proteins)
INFO    02/17/2017 15:37:43 =>  90% of predictions performed (3857/4238 candidate proteins)
INFO    02/17/2017 15:38:39 =>  100% of predictions performed
INFO    Results:
INFO    C:51.0%[S:50.5%,D:0.5%],F:10.1%,M:38.9%,n:4104
INFO    2094 Complete BUSCOs (C)
INFO    2073 Complete and single-copy BUSCOs (S)
INFO    21 Complete and duplicated BUSCOs (D)
INFO    413 Fragmented BUSCOs (F)
INFO    1597 Missing BUSCOs (M)
INFO    4104 Total BUSCO groups searched

INFO    ****** Phase 2 of 2, predictions using species specific training ******
INFO    ****** Step 1/3, current time: 02/17/2017 15:38:40 ******
INFO    Extracting missing and fragmented buscos from the ancestral_variants file...
INFO    Running tblastn, writing output to /gs/gsfs0/users/qzhang/Beaver/data/train_augustus_busco/run_Beaver/blast_output/tblastn_Beaver_missing_and_frag_rerun.tsv...
INFO    Getting coordinates for candidate regions...
INFO    ****** Step 2/3, current time: 02/17/2017 18:40:21 ******
INFO    Training Augustus using Single-Copy Complete BUSCOs:
INFO    02/17/2017 18:40:22 =>  Converting predicted genes to short genbank files...
INFO    02/17/2017 19:23:33 =>  All files converted to short genbank files, now running the training scripts...
WARNING 02/17/2017 19:23:40 => Optimizing augustus metaparameters, this may take a very long time...

BUSCOs Augustus • 3.6k views
0
Entering edit mode

Hello !

I did not clearly understand if you want to know how long will last BUSCO or Augustus calling BUSCO or just the optimization part

0
Entering edit mode

I want to know whether I can estimate when the Busco can finish. Then at that time I can use the trained Augustus to annotate a new genome. Since Busco has been running for 20 days, I just afraid I will have to wait for a really long time. In that case, I have to find another solution rather than waiting.

Thanks

0
Entering edit mode

Woh, 20 days, that's impressive ! Are you using Busco v1 or v2 ?

Anyway, using either one or an other, my runs never last more than one hour using several cores (~30 cores).

I'm still not getting, sorry, if you are launching BUSCO alone or within a kind of wrapper that is part of the training for Augustus ? I'm not well aware of Augustus raining using Busco that's why I'm a bit confused.

Are you launching Busco independently in a terminal ?

0
Entering edit mode

I used v2. It costs a long time because I used the --long parameter, which will turn on Augustus optimization mode for self-training.

Thanks

0
Entering edit mode

As long as a program is running (consuming CPU cycles in top etc) there is not much you can do but have patience. But if you have seen error messages, output files that are no longer growing then you may want to consider aborting.

0
Entering edit mode

Have you tried running it without the --long? In only takes a few hours without --long in my experience

0
Entering edit mode

Thanks. No, according to their manual --long parameter is valuable when using BUSCO sets to train gene predictors. Since my goal is to train Augustus on a new genome, I use this parameter. If I have to wait another 20 days or even longer, maybe I have to ignore --long parameter. So I wonder whether I can predict how long it will take, then I can make a plan. The program seems still running, and it seems the Augustus model for the new genome was updated early this morning.

0
Entering edit mode

Were you able to find some workaround for it. I am having the same issue

0
Entering edit mode

UPDATE:

for me with around ~63k scaffolds, it took ~2 days with 1 CPU. My problem is that this script fails at my end with multiple cores given with -c option.

0
Entering edit mode

Finally, I found we need to install the perl module “Parallel::ForkManager” to run optimize_augustus.pl in parallel. You can look at the "augustus.log" file, and check if you have the same problem.

I found the following information in the "augustus.log" file, so I installed the "Parallel::ForkManager", and after that it run much faster. ..... Writing exon model parameters [1] to file /gs/gsfs0/users/qzhang/tools/augustus3.2.3/config/species/BUSCO_NMR_100kb_Long_2426001641/BUSCO_NMR_100kb_L ong_2426001641_exon_probs.pbl. The perl module Parallel::ForkManager is required to run optimize_augustus.pl in parallel. Install this module first. On Ubuntu linux install with sudo apt-get install libparallel-forkmanager-perl Will now run sequentially (--cpus=1)...

0
Entering edit mode

How many threads (-c N or --cpu N) are you using? if you have large number of processors then set this accordingly, it will greatly increase the speed (default is one, I think)

FYI, none of my BUSCO runs (even for genomes >2.5Gbp) ever took more than 12hrs run time (with 16 procs), but I had never ran with --long option too.

1
Entering edit mode

It has been quite sometime now since the last post on this thread. I don't know if it would be relevant now but just in case if somebody landed on this post, I want to share the experience.

Interesting, in the BUSCO training mode using the --long option, it failed for me with the -c option, no matter how many threads I provide. So, in my case I didn't use the -c option and hence it ran on 1 thread and took ~2 days.

In the non-training mode i.e. evaluation mode, the -c did work and this time I provided 55 threads and it completed in an hour or so. It was way faster because of the fact that this time I was not training augustus.