Question

Retrieving Multiple Sequences For Protein Alignmnets

1

Entering edit mode

12.3 years ago

Eric ▴ 90

I would like to run multiple alignments for proteins in ~20 mammalian species. At the moment, I am retrieving the sequences manually from Ensembl and entering them in ClustalW2. Is there a more efficient way to retrieve and align the sequences? Any help is appreciated.

Thanks, Eric

multiple protein sequence ensembl clustalw • 2.9k views

ADD COMMENT • link updated 12.1 years ago by Biojl ★ 1.7k • written 12.3 years ago by Eric ▴ 90

2

Entering edit mode

An aside: If you're aligning proteins, use Clustal Omega instead of ClustalW2. It's faster and produces alignments of higher quality

ADD REPLY • link 12.3 years ago by Andreas ★ 2.5k

Ram · Answer 1 · 2012-01-22

1

Entering edit mode

12.3 years ago

2184687-1231-83- ★ 5.1k

If you only need to retrieve existing alignments from Ensembl, you can use the data dumps or the Perl API. If you need to incorporate extra sequences to the alignment, you can do it with PAGAN.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by 2184687-1231-83- ★ 5.1k

score 1 · Answer 2 · 2012-02-09

The easiest way is to download all the FASTA sequences from the species from the FTP server at ENSEMBL. Then you could store them in a python dictionary or a perl hash... (etc) and just feed the alignment programme with the sequences you want to align in each run. Fast & efficient. I do this all the time with python + mafft (or prank-F) but can be implemented with other programming languages and/or alignment programmes.