Question: Getting different results with BUSCO for different numbers of threads
2
gravatar for Chris Cole
3.4 years ago by
Chris Cole740
Scotland
Chris Cole740 wrote:

I've been recommended to use BUSCO in favour of CEGMA for the assessment of some de novo transcriptome assemblies, but I'm seeing inconsistent results with repeated runs of BUSCO and different numbers of threads (1-6). Despite rerunning with the same parameters I'm getting results which vary by upto 2 BUSCOs. I can't see anywhere suggesting that BUSCO is non-deterministic so am wondering if this is a bug.

I'm using the latest version of BUSCO v1.22, hmmer v3.1b2, python v3.4.3 and blast 2.2.29+ against the eukaryotic dataset in transcriptome mode: python3 BUSCO_v1.22.py -o test -in transcripts.fasta -l /db/busco/eukaryota -m trans -c 2

Grepping for the complete BUSCOs I see this:

./run_test_pe1/short_summary_test_pe1: 302 Complete BUSCOs

./run_test_pe2/short_summary_test_pe2: 300 Complete BUSCOs

./run_test_pe3/short_summary_test_pe3: 302 Complete BUSCOs

./run_test_pe4/short_summary_test_pe4: 301 Complete BUSCOs

./run_test_pe5/short_summary_test_pe5: 301 Complete BUSCOs

./run_test_pe6/short_summary_test_pe6: 302 Complete BUSCOs

Anyone else seen this? Any comments appreciated.

bug busco inconsistency • 1.5k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Chris Cole740
1

Don't have an answer but a wild speculation. The BUSCO python script seems to call hmmsearch which as a --seed option which is not set by the python script, which might cause some randomness in results. Second might be some sort of random sampling for calculations (evalue may be). The results should not be drastically different though. The seed option could be set by editing the python script to check if results become consistent.

ADD REPLYlink written 3.4 years ago by microfuge1.5k

Ah, thanks. Will have a poke at that.

ADD REPLYlink written 3.4 years ago by Chris Cole740

I added a --seed 12345 option to all the hmmsearch commands I found in the script but still got inconsistent results. Not really a surprise as the hmmsearch default is to set a fixed seed of '42'.

I did try '--seed 0' which does generate a random seed everytime, but that didn't change anything either.

Thanks for the suggestion anyway.

ADD REPLYlink written 3.4 years ago by Chris Cole740
6
gravatar for Chris Cole
3.4 years ago by
Chris Cole740
Scotland
Chris Cole740 wrote:

Fixed it!

There's a bug in the code where BUSCO reads in the filenames in 'translated_proteins/' with os.listdir(), which returns them in an arbitrary order. Unfortunately, this order matters for BUSCO to define whether a HMMer hit is complete, duplicated or fragmented - I've not fully pinned down why.

Sorting the list of filenames fixes the inconsistency, and shouldn't have any downstream side-effects. Although, I can't be sure.

It's a trivial change and here's the patch for anyone who is interested. I've also sent it to the author.

--- BUSCO_v1.22.py      2016-07-01 16:10:50.000000000 +0100
+++ BUSCO_v1.22c.py     2016-09-20 15:43:53.850117074 +0100
@@ -833,6 +833,7 @@
 if mode == 'trans' or mode == 'transcriptome':
   print('*** Running HMMER to confirm transcript orthology ***')
   files = os.listdir('%stranslated_proteins/' % mainout)
+  files.sort()
   if os.path.exists('%shmmer_output' % mainout) == False:
     subprocess.call(['mkdir', '%shmmer_output' % mainout])
   group = ''; grouplist = []
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Chris Cole740
1

Ah great!! Thanks for posting the solution and sorry for the misdirection earlier.

ADD REPLYlink written 3.4 years ago by microfuge1.5k

Note that the number of cpu threads was unrelated to the issue.

ADD REPLYlink written 3.4 years ago by Chris Cole740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1672 users visited in the last hour