Conserved sequence for MSA
Entering edit mode
3.4 years ago
mhwaida258 • 0

Hi , I am working on SARs sequence, I made Multible Alignment for the sequences unsing Mafft and the output in clustal format. My question is how i can extract the conserved sequences from the whole file by python script I use a python script to count this conserved region by counting number of stars (*) but for but the conserved region in the above line how to tell that in python script ( if you see the * print the previous line in the same position and so on ) Any help Or suggestion? Thank you in advance

python alignment sequence covid • 1.7k views
Entering edit mode

if you see the * print the previous line in the same position and so on

You can do exactly that. If you have an alignment that looks like this


then you could read them in blocks of three lines and use the position information from the stars. You could walk through the line with the stars and when you find groups of stars of a certain length (single positions probably don't make sense), then you can extract the sequence information corresponding to the group from the sequence lines. Be careful not to trim whitespaces off the 'conservation' line, otherwise the positions will be wrong.

Something like this:

import re
seq1, stars, seq2 = f.readline(), f.readline(), f.readline()
conserved_blocks = re.finditer("\*+", stars)
for block in conserved_blocks:
    if len( > 1:
         print(seq2[block.start(): block.end()])
Entering edit mode

it look like that exactly ;

MW345928.1      ------------------------------accaaccaactttcgatctcttgtagatct

NC_045512.2     attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct

MW322968.1      attaaaggtataaaccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct

MW575140.1      ------------------------------------------------------agatct

MW368440.1      ------------------------------------------------------agatct

MT610913.1      -------------------------aacaaaccaaccaactttcgatctcttgtagatct

MT459925.1      -------------------------aacaaaccaaccaactttcgatctcttgtagatct

MT820466.1      ---------------------------------------ctttcgatctcttgtagatct

MT412338.1      ------------------------taacaaaccaaccaactttcgatctcttgtagatct

MW420482.1      -------------------------------------------------------gatct

MT627325.1      -------gtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct

MT994988.1      -----------------------------aaccaaccaactttcgatctcttgtagatct

MT990449.1      -------------------------------------------cgatctcttgtagatct

MT291836.1      --------tttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct

MT658505.1      ----------------------------------accaactttcgatctcttgtagatct

MW580562.1      --------------------------------------actttcgatctcttgtagatct

MW581643.1      -ttaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct

MW467482.1      attaaaggtttataccttcccaggtaacaaaccaaccaactttcgatctcttgtagatct


MW345928.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

NC_045512.2     gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW322968.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW575140.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW368440.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT610913.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT459925.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT820466.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT412338.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW420482.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT627325.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT994988.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT990449.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT291836.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MT658505.1      gttctctaaacgaacwtkaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW580562.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW581643.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

MW467482.1      gttctctaaacgaactttaaaatctgtgtggctgtcactcggctgcatgcttagtgcact

                *************** * ******************************************
Entering edit mode

it's a similar approach, though.

  1. store each line until you reach one with stars (if a star means this position is conserved in all sequences, then you may just store one single sequence)
  2. store the conservation line
  3. walk through the next block of sequences, append them to the sequences you have stored (s. remark at 1.)
  4. append the next conservation line to the stored one 5 .. rinse repeat
  5. use the conserved_blocks regular expression on the full conservation line
  6. for each sequence you stored, extract the sequences according to the positions of the conserved blocks (as in the code above), you can print this, or store this
  7. if you have stored the extracted conserved sequences, you can now remove duplicates, or do statistics on them etc.
Entering edit mode

Thank you so much, your code is inspiring, can you tell me how to iterate over the file as read line read one line in the file to do the same ?

Entering edit mode

You can iterate over rows and columns of a multiple sequence alignment using the AlignIO module in biopython.

It should be reasonably easy to adapt this code: A: Trim sequences based on alignment in python


Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6