Question: How to run a set or batch of genome assemblies at once in one go?
0
gravatar for jerrybug109
3.6 years ago by
jerrybug10910
United States
jerrybug10910 wrote:

Hi All,

I'm trying to assemble several dozen prokaryotic genomes using SPAdes. My inputs are paired end illumina reads (2x125). I've learned how to use the software but am unfamiliar with programming - when it comes to bioinformatics, I just know basic unix commands and how to navigate and manipulate files and directories in my university's linux server.

The command in SPAdes I use for a single genome assembly is: spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory.

It seems time consuming to run each genome assembly one by one. Is there a way to run the entire set of separate genome assemblies in one go, so as to save time and trouble? Do I need to know python script? I would appreciate your input, thank you!

ADD COMMENTlink modified 2.8 years ago by tans03070 • written 3.6 years ago by jerrybug10910
1
gravatar for Philipp Bayer
3.6 years ago by
Philipp Bayer6.5k
Australia/Perth/UWA
Philipp Bayer6.5k wrote:

SPAdes, by default, uses 16 threads (says the manual). Are you running it with that default? Is your university server a distributed system (PBS, SLURM etc.) or is it just one big server? If you want to run all of your SPAdes on one big shared server you might want to talk to the system administrators first, they'll get angry if you block the entire thing for days.

If it's one big server and you're OK to go, you can use several ways. You can run send jobs to the background, one for each assembly, for example in bash:

spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory. &

spades.py --careful -1 my_forward2.fastq.gz -2 my_reverse2.fastq.gz -o /my/output/directory2. &

(notice the &)

Then with the "jobs" command you can see all running jobs, and with "fg 1", "fg 2" etc. you can get them back to the foreground, and with CTRL+Z and then entering "bg" you can send them back to the background.

You can also use a for-loop to start all jobs at once:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
spades.py --careful -1 $file1 -2 $file2 -o $out &
done

This will iterate over all files containing "R1" and ending in fastq, get the second file by replacing R1 by R2, and puts the output into a path based on R1 but with the ".fastq" cut off, and with "_output" added

That's the easiest way, but then you can't directly quit the current session. For that, have a look at the "screen" command.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Philipp Bayer6.5k

Ah you bring up a good reminder for me - our uni has two servers, one that's shared and one that's not. I'm currently on the shared so I'll see if I can work something out. Thanks for the examples! I'll have a go at it given the chance.

ADD REPLYlink written 3.6 years ago by jerrybug10910

Your first example says how to run those two jobs simultaneously in the background, correct?

If I want to run a series of jobs sequentially instead of simultaneously, would this do it:

( job1 ; job2) &

or more specifically:

(spades.py --careful -1 my_forward.fastq.gz -2 my_reverse.fastq.gz -o /my/output/directory ; spades.py --careful -1 my_forward2.fastq.gz -2 my_reverse2.fastq.gz -o /my/output/directory) &

It seems that I can avoid the potential to hog the server if I just let things run one by one instead of simultaneously. Thanks!

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by jerrybug10910

Yes, if you add "&" each job is run in the background, so they all run at the same time, possibly killing your server.

If you want to run it sequentially, you can either do it your way with ";", try this example:

echo "hi" ; sleep 2; echo "hello again"

This will print "hi", then sleep for 2 seconds, then print "hello again".

(Side-note: you can also use "&&",

echo "hi" && sleep 2 && echo "hello again"

this will abort if one of the commands returns an error)

You can also run the above for loop without the "&" if you're feeling lazy and don't want to spell out all commands:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
spades.py --careful -1 $file1 -2 $file2 -o $out
done

Testing it with echo only, won't run the command, just print it:

for file1 in *R1*fastq
do 
file2=${file1/R1/R2}
out=${file1%%.fastq}_output
echo spades.py --careful -1 $file1 -2 $file2 -o $out
done

It's easier to put that into a bash script and execute that script via "bash run_all_assemblies.sh"

ADD REPLYlink written 3.6 years ago by Philipp Bayer6.5k

Probably also worth pointing out, even if you're the sole user of your server resource, you are still limited by your number of cores. For example, if your server is a 32 core machine, and you try to launch 3 instances of SPAdes each with 16 cores, all that will happen is that those 3 will complete slowly as they fight for CPU time, and it'll probably end up slower than running 3 sequentially - assuming it completes at all.

ADD REPLYlink written 2.8 years ago by Joe14k

@Philipp Bayer Thank you for your example. I was able to apply this to another program (UPARSE fastq_mergepairs) using a directory of over 700 files. This saved me a lot of time.

ADD REPLYlink written 2.0 years ago by Tawny130
0
gravatar for tans0307
2.8 years ago by
tans03070
tans03070 wrote:

Hello,

I am trying to combine multiple files into one big assembly with spade, so that I just get one scaffold file.

The responses above are helpful for multiple assemblies, but I am just aiming for one.

I appreciate any suggestions that I can get, thanks!

ADD COMMENTlink written 2.8 years ago by tans03070

Merge the paired reads files in to two files:

For lots of PE reads file:

A_1.fq.gz   A_2.fq.gz  B_1.fq.gz  B_2.fq.gz ...

Merge them:

gzip -d -c  *_1.fq.gz | gzip -c > merged_1.fq.gz
gzip -d -c  *_2.fq.gz | gzip -c > merged_2.fq.gz

PS: replace with pigz if you install it, which is much faster than gzip.

PS2: gzip -d -c is equal to zcat.

PS3: if you have decompress .gz file, just cat *_1.fq > merged_1.fq

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by shenwei3564.8k

@shenwei356, thanks alot for your help! :)

ADD REPLYlink written 2.8 years ago by tans03070

You should probably ask this as a separate question, not an answer to another thread..

ADD REPLYlink written 2.8 years ago by Joe14k

@jrj.healey, noted! This is my first time posting, thanks. :)

ADD REPLYlink written 2.8 years ago by tans03070
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1356 users visited in the last hour