End to End Solution for Multiple Sequence Alignment on Fasta Files in Python
Entering edit mode
8 months ago

Hello, I'm new to data analysis and I'm hoping to write a python algorithm that can take in fasta files as an input, conduct a multiple sequence alignment, and output them again in fasta format.

Are there any tutorials that can show this from end to end? I've tried using muscle but keep getting command errors, and my code with ClustalOmega hasn't worked either, as I keep finding differing answers online.

Multiple-Sequence-Alignment Fasta Python • 1.2k views
Entering edit mode

If you have linux, you can merge all fasta files with cat command, like cat a.fa b.fa c.fa > all.fa. Then you can run mafft to get MSA output, like mafft all.fa > msa.fa. If you want to use python, you can wrap this 2 commands, just use subprocess module of python.

Entering edit mode

Thank you, I'll try mafft!

I have also been trying to solve this with biopython, and this is the code I have pieced together from reading different references. I imagine there must be a tutorial, or even a book somewhere that shows this very clearly, but I am having trouble finding it. Would love some help.

import Bio
from Bio import AlignIO
from Bio import SeqIO
from Bio.Align import MultipleSeqAlignment
from Bio.Align.Applications import ClustalOmegaCommandline
import os
import tkinter as tk
from tkinter import filedialog

root = tk.Tk()
input_file = filedialog.askopenfilename()

file_name = f"analzyed{input_file}"

# Perform multiple sequence alignment using Clustal Omega
clustalomega_cline = ClustalOmegaCommandline(infile=input_file, outfile="{file_name}", outfmt = 'fasta', verbose=True, auto=False)


sequences = list(SeqIO.parse(file_name, "fasta"))

#not sure how to actually run the clustal omega on the list of sequences

SeqIO.write(aligned_sequences, file_name, "fasta")
Entering edit mode
8 months ago
Joe 21k

You don't need to write any algorithm for this. Most if not all contemporary multiple sequence aligners (MUSCLE/MAFFT/CLUSTALO etc) will ingest multi-fasta files and write back out to an aligned multifasta with a single command.

If the input files are separate fastas you can also just do this at the commandline in the program call similar to:

muscle < $(cat *.fa) > seqs.afa

Simple as that*.

If you're encountering issues with this, you'll need to show us the command you're using, the error you're getting and a snippet of your input data.

*one caveat is that _may not work_ if you have absolutely tons of files that meet the .fa glob, because there are limits on the max size terminal commands can be and it may expand too many args.


Login before adding your answer.

Traffic: 1056 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6