Question: Align Sequentially All Files In A Protein Directory
gravatar for Louis
7.7 years ago by
Louis50 wrote:

Thank you for reading my question. I am attempting to create multiple MUSCLE alignments for thousands of fasta in a directory.

The input would be: vbro1.fasta, vbro2.fasta... vbro6405.fasta; where a given file commonly contains 20 or more proteins.

The output would be: vbro1.afa, vbro2.afa... vbro6405.afa

My first thought was a for loop in bash like this. My second thought was to run this in biopython's MUSCLE wrapper. Either way, I'm concerned that the job is too big and that I'll need to write a script that will interate through the multifasta in more manageable pieces.

for i in outdir; do echo "muscle -in file${i} -out ${i}.afa -maxiters 1 -diags1 -sv" done

I would greatly appreciate some assistance if someone could point me in the right direction.

multiple • 3.1k views
ADD COMMENTlink modified 21 months ago by DataFanatic140 • written 7.7 years ago by Louis50
gravatar for Andreas
7.7 years ago by
Andreas2.4k wrote:

As an alternative, here a bash solution, based on your initial attempt:

for f in $(ls your-input-dir/vbro*.fasta); do
  muscle -in $f -out ${f%.fasta}.afa -maxiters 1 -diags1 -sv;

If you want to run things in parallel, then you could simply echo the muscle commands, redirect into a file and then use xargs to use the amount of processors you like. For example if your Muscle commands are listed in cmd.txt and you want to run four jobs in parallel:

cat cmd.txt | xargs -I cmd --max-procs=4 bash -c cmd &


ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Andreas2.4k

Nice! I like to see bash solutions. It is possible to avoid the creation of the command file: find . -name "vbro.fasta" -type f -print0 | xargs -0 --max-procs=4 -L1 -I FILE muscle -in FILE -out FILE.afa -maxiters 1 -diags1 -sv Caveat: output file names are in .fasta.afa

ADD REPLYlink written 7.7 years ago by Frédéric Mahé2.9k

Also and excellent solution that works very well! And thank you for showing me how to run the jobs in parallel.

ADD REPLYlink written 7.7 years ago by Louis50

I must say I learned new stuff about bash thanks to you.

ADD REPLYlink written 7.7 years ago by Darked894.2k
gravatar for Darked89
7.7 years ago by
Barcelona, Spain
Darked894.2k wrote:

Maybe this:

#!/usr/bin/env python

import glob, os

muscle_options = " -maxiters 1 -diags1 -sv "

for fasta_file in glob.glob("vbro*.fasta"):
    afa_file = fasta_file[:-5] + "afa"
    command = "muscle -in %s -out %s %s" % (fasta_file, afa_file, muscle_options)
    print "executing: " + command

Muscle is quite fast. If you need to split your muscle jobs you may tweak "vbro.fasta" and submit 10 jobs ("vbro0.fasta", "vbro*1.fasta" etc.) as a quick and dirty solution.

ADD COMMENTlink written 7.7 years ago by Darked894.2k

Thank you! Python's glob module is totally new to me. I ran this on a subset of 10 fasta and it worked like a charm. I'm inspecting those alignments now and if all look tight I'll run your script on the entire data set. This is especially useful bc I can use versions of this as I move forward TrimAl and RAxML.

ADD REPLYlink written 7.7 years ago by Louis50
gravatar for  DataFanatic
21 months ago by
DataFanatic140 wrote:

Hi, I have 366 RNAseq samples, reads for each sample are located in individual folders with the following name structure:SAME-DIFFERENT. I would like to write a loop to align all samples using the STAR aligner. Any help is deeply appreciated.

ADD COMMENTlink written 21 months ago by DataFanatic140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1291 users visited in the last hour