Question: Creating alignments for 800+ multifasta files
2
gravatar for Caroline S
5.9 years ago by
Caroline S20
Sweden
Caroline S20 wrote:

I have ~800 multifasta files (one gene/file) for which I would like to create independent alignments (preferably with MAFFT) automatically.

So instead of typing 'mafft --retree 1 input_file > output_file ' for each file I am looking for a way where the next fasta file is automatically grabbed and aligned after the previous alignment has finished.

 

 

 

 

mafft alignment • 2.9k views
ADD COMMENTlink modified 5.9 years ago by _r_am30k • written 5.9 years ago by Caroline S20
0
gravatar for _r_am
5.9 years ago by
_r_am30k
Baylor College of Medicine, Houston, TX
_r_am30k wrote:

Use a UNIX for loop maybe?

FILE_1=`ls *.fasta | head -n 1`

FILE_2='ls *.fasta | head -n 2 | tail -n +2`
OUT_PUT=init_output.fasta
mafft -retree 1 "`cat ${FILE_1} ${FILE_2}`" >${OUT_PUT}

for fasta_file in $(ls *.fasta | tail -n +3)
do
IN_PUT=`cat ${OUT_PUT} ${fasta_file}`
mafft --retree 1 ${IN_PUT} > ${OUT_PUT}
done

 

This should hold the final output in ${OUT_PUT} once done. I ran this in my head though, so take a backup to avoid loss of data owing to my errors in contingency planning.

ADD COMMENTlink written 5.9 years ago by _r_am30k

Ram, I think OP wants to align each of the 800+ FASTA files individually and write the corresponding output. So, there will be 800+ output files. Assuming OP has the FASTA files with .fasta extension, the following script will run MAFFT and write output with suffix .out added to the input file name.

for fasta_file in $(ls *.fasta)
do
mafft --retree 1 $fasta_file > $fasta_file.out
done
ADD REPLYlink written 5.9 years ago by Siva1.7k

But MAFFT, from its doc, only aligns sequences contained in the input fasta file and OP mentions each file has one gene (I assume 1 sequence) per file. Aligning one seq makes no sense, no?

ADD REPLYlink written 5.9 years ago by _r_am30k
1

I can see how "one gene/file" might lead you to think that there is only one sequence per file. However the OP mentions "multifasta file" which is a file containing multiple sequences in FASTA format.  So, the way I understand is OP has 800+ files with each file having multiple related sequences of similar function ("one gene"), probably from different species in FASTA format and they want to align these related sequences within a file. I think OP is a biologist like me :)

ADD REPLYlink written 5.9 years ago by Siva1.7k

Thank you both for your input! The option suggested by Siva is working out! And indeed, I have multiple sequences for one gene in each file; sorry if that wasn't clear.

ADD REPLYlink written 5.9 years ago by Caroline S20

Ah, that makes sense. I should've known to go for Occam's razor, but I guess the align each file kinda seemed too simple to me, so I assumed that OP was looking for the solution to a far more complex problem. Thank you, Siva!

ADD REPLYlink written 5.9 years ago by _r_am30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1697 users visited in the last hour