Hi everyone,
After aligning DNA sequences, I would like to extract a sub alignment as follow:
- scroll the alignment and find the first and last X-bp conserved blocks where X is a number given by the user.
- extract the sub alignment surronded by the two conserved blocks.
I tried this with python wich uses a sliding window to get chunks and analyze each of them:
#! /bin/python
# coding: utf-8
import os, sys, re, glob
# Perform alignment
#--------------------------------------------------------------------------
from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile = "opuntia2.fasta")
stdout, stderr = cline()
# Print alignment
from Bio import AlignIO
align = AlignIO.read("opuntia2.aln", "clustal")
print(align)
# Calculate alignment length in bp
n = int(align.get_alignment_length())
# Define sliding window parameters
winSize = 100
step = 1
# Define a sliding window
#---------------------------------------------------------------------------
def slidingWindow(align,winSize,step):
# Pre-compute number of chunks to emit
numOfChunks = ((n-winSize)/step)+1
# Do the work
for i in range(0,numOfChunks*step,step):
start = i
end = i+winSize
yield align[:,i:i+winSize]
#yield start, end #if I uncomment this lines problems appear because this function's output is a multiple seq aln and then I cannot manipulate this object easily
# Define a function that check if a column has common nucleotides or not
#---------------------------------------------------------------------------
def all_same(items):
return all(x == items[0] for x in items)
# Real process starts here
#--------------------------------------------------------------------------
# List all possible chunks by applying the sliding window on aln
chunks = slidingWindow(align,winSize,step)
for chunk in chunks:
print chunk
# Loop on every chunk to see if it is conserved or not
for p in range(winSize):
column = chunk[:, p]
tester = all_same(column)
print tester
Then, I wanted to check if the tester variable is True for all position of each chunk. If true, then it means all the chunks is conserved and I could save its position in a record file and continue to examine each chunk. At the end, I could select the first and last line of the record file and extract the corresponding subalignment from the original one.
I clearly see that my solution is very awkward and I would be pleased if anyone could had new (simpler) perspectives!