Ignore Error "Multiple Sequences Found With Same Name" In Clustalw
1
1
Entering edit mode
8.3 years ago
david ▴ 10

Hi,

I have a python program generating a clustalw2 alignment of about 500 sequences from a fasta file. The names of the sequences correspond to the respective organisms plus the substrate specificity of a given sequence. Therefore quite a few of these names are identical and i get the error message: "Error: Multiple sequences found with same name" and no alignment is generated. Is it possible to ignore this error without having to change all the sequence names?

Cheers David

biopython clustalw • 3.7k views
7
Entering edit mode
8.3 years ago

The names of the sequences must be unique to do alignment in ClustalW/X.

I would name your 500 sequences as numbers from 0 to 499 and store the original names in a dictionary or a list.

For example:

d = {1: 'Organism1Substrate', 2:'Organism1Substrate' , ..., 499:'Organism2Substrate'}


or:

l = ['Organism1Substrate', 'Organism1Substrate', 'Organism1Substrate', ..]


Once you performed the alignment, just replace the numbers with original names.

1
Entering edit mode

+1 for this. In the past I have just GREPed the names and added numbers or more information to make them unique, but I like this idea better.

1
Entering edit mode

Agree. Many phylogenetic programs have problems handling fancy sequence names. The horrible case is phylip format (used by RAxML etc) which allows only 10 characters per name. So I always rename the sequences as "s1", "s2", s3"... I don't recommend using 1, 2, 3... because some programs cannot handle numerical sequence names.