Creating alignments for 800+ multifasta files
1
2
Entering edit mode
6.7 years ago
Caroline S ▴ 20

I have ~800 multifasta files (one gene/file) for which I would like to create independent alignments (preferably with MAFFT) automatically.

So instead of typing 'mafft --retree 1 input_file > output_file ' for each file I am looking for a way where the next fasta file is automatically grabbed and aligned after the previous alignment has finished.

 

 

 

 

alignment mafft • 3.4k views
ADD COMMENT
0
Entering edit mode
6.7 years ago
Ram 34k

Use a UNIX for loop maybe?

FILE_1=`ls *.fasta | head -n 1`

FILE_2='ls *.fasta | head -n 2 | tail -n +2`
OUT_PUT=init_output.fasta
mafft -retree 1 "`cat ${FILE_1} ${FILE_2}`" >${OUT_PUT}

for fasta_file in $(ls *.fasta | tail -n +3)
do
IN_PUT=`cat ${OUT_PUT} ${fasta_file}`
mafft --retree 1 ${IN_PUT} > ${OUT_PUT}
done

 

This should hold the final output in ${OUT_PUT} once done. I ran this in my head though, so take a backup to avoid loss of data owing to my errors in contingency planning.

ADD COMMENT
0
Entering edit mode

Ram, I think OP wants to align each of the 800+ FASTA files individually and write the corresponding output. So, there will be 800+ output files. Assuming OP has the FASTA files with .fasta extension, the following script will run MAFFT and write output with suffix .out added to the input file name.

for fasta_file in $(ls *.fasta)
do
mafft --retree 1 $fasta_file > $fasta_file.out
done
ADD REPLY
0
Entering edit mode

But MAFFT, from its doc, only aligns sequences contained in the input fasta file and OP mentions each file has one gene (I assume 1 sequence) per file. Aligning one seq makes no sense, no?

ADD REPLY
1
Entering edit mode

I can see how "one gene/file" might lead you to think that there is only one sequence per file. However the OP mentions "multifasta file" which is a file containing multiple sequences in FASTA format.  So, the way I understand is OP has 800+ files with each file having multiple related sequences of similar function ("one gene"), probably from different species in FASTA format and they want to align these related sequences within a file. I think OP is a biologist like me :)

ADD REPLY
0
Entering edit mode

Thank you both for your input! The option suggested by Siva is working out! And indeed, I have multiple sequences for one gene in each file; sorry if that wasn't clear.

ADD REPLY
0
Entering edit mode

Ah, that makes sense. I should've known to go for Occam's razor, but I guess the align each file kinda seemed too simple to me, so I assumed that OP was looking for the solution to a far more complex problem. Thank you, Siva!

ADD REPLY

Login before adding your answer.

Traffic: 2218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6