Question

Tutorial:Parallel AUGUSTUS Execution via GNU Parallel

0

Entering edit mode

11 months ago

Vijith ▴ 100

Hi, fellow bioinformaticians,

I recently ran AUGUSTUS for ab initio gene prediction on my 2.6GB plant genome using an 8-core server with 244GB memory.

However, AUGUSTUS utilized only one CPU core, resulting in slow performance (~25MB *.gff3 output per day). After reviewing the AUGUSTUS documentation, I couldn't find a parameter to set the number of CPU cores.

To overcome this, I used GNU parallel to run multiple AUGUSTUS instances in parallel by splitting the main FASTA file into chunks corresponding to the number of cores. I've documented my protocol in a tutorial on my page and would love to share it with the community.

Are there alternative methods to make AUGUSTUS utilize multiple cores? Please share your insights. Link to tutorial: [https://lifescienceshub.wixsite.com/lifesciencehub/post/how-to-leverage-gnu-parallel-to-utilize-multiple-cores-while-running-augustus\]

augustus genome ngs parallel • 6.8k views

ADD COMMENT • link updated 6 days ago by Meiers • 0 • written 11 months ago by Vijith ▴ 100

0

Entering edit mode

6 days ago

Meiers • 0

Hi, I know I'm a little late to the thread, but I just wanted to share a potential improvement to the script. I used the python script provided by Vijithkumar and noticed that the subsets were often not distributed evenly. In the left pane in the image below, notice that the largest subset is 229MB while the second largest subset is only 6MB.

I had an idea to place sequences into each subset in such a way that the size of the largest subset would be minimized, decreasing runtime. There are a few algorithms that tackle this problem, but I modified the python script to implement the prtpy package that employs the Greedy Number Partitioning algorithm. After doing so, we can see the sizes of the subsets in the right pane. The largest subset is nearly 5x shorter than before.

The arguments for the updated python script are the same, but you will need to run: $ pip install prtpy so python can access this package. Here is a link to the github repo. I have called the updated script greedy_fasta_subset.py

I hope this is helpful! This is also my first post on this website, so apologies if I miss something.

enter image description here

ADD COMMENT • link 6 days ago by Meiers • 0

score 2 · Accepted Answer · 2024-10-05

2

Entering edit mode

11 months ago

Pierre Lindenbaum 166k

Nice, but you'd better learn to use a workflow manager like snakemake or nextflow. See NF below (not tested):

workflow {
        ch0 = Channel.fromPath(params.fasta)
    ch1 = FAIDX(ch0).output
    ch2 = ch1.splitCsv(header:false,sep:'\t').map{it[0]}
    ch3 = APPLY_AUGUSTUS(ch1.combine(ch0).combine(ch2))
    MERGE(ch3.output.collect())
}


process FAIDX {
input:
    path(fasta)
output:
    tuple path("*.fai"),emit:output
script:
"""
samtools faidx ${fasta}
"""
}

process APPLY_AUGUSTUS {
input:
    tuple path(fai),path(fasta),val(contig)
output:
    path("${contig}.gff3"),emit:output
script:
"""
samtools faidx ${fasta} ${contig} > tmp.fa
augustus --species=maize --progress=true --gff3=on tmp.fa > "${contig}.gff3"
rm tmp.Fa
"""
}


process MERGE {
input:
    path(gff3)
output:
    path("final.output.gff3"),emit:output
script:
"""
cat ${gff3} > final.output.gff3
"""
}

ADD COMMENT • link 11 months ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much, Dr. Lindenbaum, for the valuable comment. I'm not quite experienced in nextflow, but I like to test this out. Can you provide some details about this code, or any resources to understand it?

ADD REPLY • link 11 months ago by Vijith ▴ 100

2

Entering edit mode

Start here: https://www.nextflow.io/

ADD REPLY • link 11 months ago by GenoMax 153k

score 2 · Accepted Answer · 2024-10-07

Instead of making a python script for splitting, you can use --block -1 --pipe-part --cat --recend "\n" --recstart ">":

parallel --block -1 -a big.fasta --pipepart --cat --recend "\n" --recstart ">" augustus [...] {}

This will automatically split the fasta file into 1 chunk per CPU thread. It will save the chunks into temporary files before calling augustus.

If augustus can read from stdin (e.g. by: augustus -) you can bypass generating the temporary files:

parallel --block -1 -a big.fasta --pipepart --recend "\n" --recstart ">" augustus [...] -

If augustus has very varying runtime, it might make sense to split big.fasta into more chunks, say, 3 per CPU thread: --block -3 This if a single chunk takes forever, then the other CPU threads will pick up the other chunks.