Find conserved blocks and extract subalignment with Python
Entering edit mode
6.4 years ago
Kevin D ▴ 30

Hi everyone,

After aligning DNA sequences, I would like to extract a sub alignment as follow:

  • scroll the alignment and find the first and last X-bp conserved blocks where X is a number given by the user.
  • extract the sub alignment surronded by the two conserved blocks.

I tried this with python wich uses a sliding window to get chunks and analyze each of them:

#! /bin/python
# coding: utf-8

import os, sys, re, glob

# Perform alignment
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile = "opuntia2.fasta")
stdout, stderr = cline()
# Print alignment
from Bio import AlignIO
align ="opuntia2.aln", "clustal")
# Calculate alignment length in bp
n = int(align.get_alignment_length())
# Define sliding window parameters
winSize = 100
step = 1

# Define a sliding window
def slidingWindow(align,winSize,step):

    # Pre-compute number of chunks to emit
    numOfChunks = ((n-winSize)/step)+1

    # Do the work
    for i in range(0,numOfChunks*step,step):
        start = i
        end = i+winSize 
        yield align[:,i:i+winSize]
        #yield start, end #if I uncomment this lines problems appear because this function's output is a multiple seq aln and then I cannot manipulate this object easily

# Define a function that check if a column has common nucleotides or not
def all_same(items):
    return all(x == items[0] for x in items)

# Real process starts here
# List all possible chunks by applying the sliding window on aln
chunks = slidingWindow(align,winSize,step)

for chunk in chunks:
    print chunk
    # Loop on every chunk to see if it is conserved or not
    for p in range(winSize):
        column = chunk[:, p]
        tester  = all_same(column)
        print tester

Then, I wanted to check if the tester variable is True for all position of each chunk. If true, then it means all the chunks is conserved and I could save its position in a record file and continue to examine each chunk. At the end, I could select the first and last line of the record file and extract the corresponding subalignment from the original one.

I clearly see that my solution is very awkward and I would be pleased if anyone could had new (simpler) perspectives!

multiple-sequence-alignment python • 3.2k views

Login before adding your answer.

Traffic: 2373 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6