Question

Parsing A Clustal Alignment File

0

Entering edit mode

11.8 years ago

s.charonis ▴ 100

Hello BioStar Community,

I have a CLUSTAL alignment file with 500 protein sequences whose format is as such:

S1 DERY .....

S2 RKH .... ....

S500 HERKKK ....

where S1,S2,....S500 are Uniprot codes.

However, each of the 500 entries appears multiple times because each individual line is limited to n characters. So, S1,S2,...S500 appear multiple times and I need to parse each component together in order to get each individual sequence stored as a single string. The code for doing this on a single sequence is straightforward, first I put all Uniprot codes into a list so that I can identify each protein sequence and then I use a snippet like

S1 = '' 
for line in alignment: 
if line.startswith('3NY8A'):  
    line = line.lstrip('3NY8A') 
    line = line.rstrip('\n') 
    line = line.rstrip('\t')    
    templateSeq += line 
    templateSeq = ''.join(templateSeq.split())

But S1 is only one of the 500 sequences and I need to automate this procedure for all of the 500 sequences. What I think is necessary is to write a function (e.g. sequencetostring) and then implement this function over all lines in the alignment file. My implementation is

seq = ''
def sequence_to_string():
''' make each MSA sequence a string ''' 
    global seq 
    line = line.lstrip(unique_uniprot_id[i]) 
    line = line.rstrip('\n')
    line = line.rstrip('\t') 

    seq += line 
    seq = ''.join(seq.split())
    return seq

And then loop over the file with:

for i in range(len(unique_uniprot_id)):
  sequence_to_string()

But this does not return anything. Is there a way to modify my function so that it builds 500 strings (maybe I need to put them in a list?), one per protein sequence? Many thanks in advance!

Regards, Spyros

python • 5.1k views

ADD COMMENT • link updated 11.8 years ago by Damian Kao 16k • written 11.8 years ago by s.charonis ▴ 100

score 7 · Answer 1 · 2012-07-10

7

Entering edit mode

11.8 years ago

Neilfws 49k

Any reason not to use the Bio.AlignIO module from Biopython to do this? It's generally better to use existing libraries than to reinvent the wheel.

ADD COMMENT • link 11.8 years ago by Neilfws 49k

1

Entering edit mode

That would be my suggestion as well.

ADD REPLY • link 11.8 years ago by cl.parallel ▴ 140

0

Entering edit mode

Mine too! BioPython rocks!! ;) Do something like the following to convert from CLUSTAL to FASTA:

from Bio import AlignIO

input_file = open("uniprot_alignment.clustal", "rU")
output_file = open("uniprot_alignment.fasta", "w")

alignments = AlignIO.parse(input_file, "clustal")
AlignIO.write(alignments, output_file, "fasta")

output_file.close()
input_file.close()

ADD REPLY • link 11.8 years ago by Steve Moss 2.3k

score 1 · Answer 2 · 2012-07-10

So you want to parse the alignment file and get each aligned sequence. As Neil suggested, I would just use BioPython.

For whatever reason, if you don't want to use BioPython, you can just do something like this:

seqs = {}
for line in alignmentFile:
   data = line.strip().split()
   seqID = data[0]
   seq = data[1]
   if not seqs.has_key(seqID):
      seqs[seqID] = seq
   else:
      seqs[seqID] += seq

for seqID, seq in seqs.items():
   print ">" + seqID 
   print ''.join(seq.split())

You don't need a separate function or anything like that. Use data structures to your advantage.

I am not entirely sure what your alignment file format looks like. Is it just a space delimited format? You might have to mess around with the parsing part. This should print out the sequences in fasta format.

score 0 · Answer 3 · 2012-07-10

As a rule, you should try to avoid using global references. Your other references are confusing as well (where does line come from in sequencetostring?)

You could use a dictionary of lists:

def seqtostring(alignment):
  sequences = {}
  # header = alignment.readline() # if present
  for line in alignment: # the actual file, line-by-line
    if len(line): # spaces between groups
      name, seq = line.split('\t')
      if name not in sequences:
          sequences[name] = []
      sequences[name].append(seq.strip().upper())

  return sequences

with open(alignment_file) as openfile:
    seqs = seqtostring(openfile)
print id_of_interest, ''.join(seqs[id_of_interest])