Question

Consensus Sequence

4

Entering edit mode

15.1 years ago

User 5217 ▴ 40

Hello, I have below 8 sequences and I would like to calculate a consensus sequence from them.

sequences = [['C', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'T', 'T', 'C', 'T', 'G', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'C', 'A', 'A', 'T', 'T', 'G', 'T', 'T', 'T', 'A', 'G'],
             ['C', 'T', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'G', 'T', 'C'],
             ['T', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['C', 'C', 'T', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'G', 'T'],
             ['C', 'C', 'A', 'A', 'T', 'T', 'G', 'T', 'T', 'T', 'T', 'G']
            ]

for i in range(len(sequences[1])):
  alignment = ""
  for j in range(len(sequences)):
    alignment += sequences[j][i]
  print alignment
  print alignment.count("A")
  print alignment.count("C")
  print alignment.count("G")
  print alignment.count("T")
  print "----------"

The above code calculates to each position how often a base occurs (Position Frequency Matrix). I have found the following rules ( http://www.cisred.org/content/methods/help/pfm ) to calculate the consensus sequence, but unfortunataly I do not quite understand it yet to complete the implementation of consensus sequence.

Thank you in advance.

Best regards,

biopython python consensus • 9.2k views

ADD COMMENT • link updated 14.8 years ago by brentp 24k • written 15.1 years ago by User 5217 ▴ 40

1

Entering edit mode

You should look at Brad's suggestion using Biopython in this question: Create Consensus Sequences For Sequence Pairs Within A Multiple Alignment?

ADD REPLY • link updated 6.2 years ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

0

Entering edit mode

Notes: If you want the length of the first sequence then you should use len(sequences[0]) instead of 1. Without modifying the rest of the code, the sequences could be in string format "CCCATTGTTCTC". Cheers

ADD REPLY • link updated 6.2 years ago by Ram 45k • written 15.1 years ago by Eric Normandeau 11k

Ram · Answer 1 · 2010-10-06

Check out motility which does exactly that:

import motility
sequences = [['C', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'T', 'T', 'C', 'T', 'G', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'C', 'A', 'A', 'T', 'T', 'G', 'T', 'T', 'T', 'A', 'G'],
             ['C', 'T', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'G', 'T', 'C'],
             ['T', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['C', 'C', 'T', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'T', 'C'],
             ['T', 'C', 'C', 'A', 'T', 'T', 'G', 'T', 'T', 'C', 'G', 'T'],
             ['C', 'C', 'A', 'A', 'T', 'T', 'G', 'T', 'T', 'T', 'T', 'G']
            ]

pwm = motility.make_pwm(sequences)
print pwm.generate_sites_over(pwm.max_score())

prints

('CCCATTGTTCTC', 'TCCATTGTTCTC')