Question

Pattern Alignment And Report Identity

0

Entering edit mode

10.1 years ago

abhiwash • 0

I have 2 fasta files with sequence's in it.i want to align the sequences in second file to first file and report identity

for example:

file1:

>s1
aaccggactggacatccg 
>s2
gtcgactctcggaattg
....

file2:

 >a1
actg
>a2
tccg
 ....

i want to take the file2 sequences and look in file1 and print the matching with mismatched base in uppercase and identity in csv format

output

name,a1_alignment,a1_identity,a2_alignment,a2_identity
s1,actg,100,tccg,100
s2,aCtg,95,tcCg,95

here what i did so far

import sys
import os,csv
from Bio import SeqIO
from itertools import *
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-m", "--mismatch_threshold", dest="mismatch_threshold", default = 2,
          help="This is the number of differences you'll allow between the   actualread and your sequence of interest.  Default is 2")
(options, args) = parser.parse_args()

if len(sys.argv) != 4:
  print "Usage : python search.py <file1> <file2> <fout>"
sys.exit()

f1 = open(sys.argv[1],'r')
f2 = open(sys.argv[2],'r')
fout = open(sys.argv[3],'w')
writer = csv.writer(fout)

def long(f1):
    for record in SeqIO.parse(f1,'fasta'):
        header = record.name
        sequence = record.seq   
   yield [header, sequence]

def short(f2):
    for record in SeqIO.parse(f2,'fasta'):
        head = record.name
        seq = record.seq  
    return seq

def alignment(sequence,seq,mismatch_threshold):
    l1 = len(sequence)
    l2 = len(seq)
    alignment = []
    for i in range(0,min(l1,l2)):
        if sequence[i] == seq[i]:
            alignment.append(i)
        else:
            mismatch = sum( c1 != c2 for c1,c2 in zip(sequence,seq))                                    
                if mismatch <= mismatch_threshold:                           
                         alignment.append(i)
    k = 0
l = 0
for read in alignment:
    for letter in read:
        if letter == isupper():
            pass
        else:
            if letter == alignment[0].seq[j]:
                l +=1
        k += 1
    k = 0
    length = seq
    percent = 100*l/len(seq)
    #print percent
    yield percent


longsequences = long(open(sys.argv[1],'r'))
shortsequences = short(open(sys.argv[2],'r'))
align = alignment(longsequences,shortsequences,options.mismatch_threshold)

for name in head:
    writer.writerow(( name +'_alignment' , name + '_identity'))
for s in align:
# print to csv file

i am confused and have problem in reading the the sequences in the def function getting this error

what is the best of looping the sequences of file2 and aligning them to sequences in file1 and find an alignment with mismatches?

File "search.py", line 34, in alignment
    l1 = len(sequence)
TypeError: object of type 'generator' has no len()

python biopython alignment • 2.7k views

ADD COMMENT • link updated 7.2 years ago by Biostar 20 • written 10.1 years ago by abhiwash • 0

1

Entering edit mode

Looks like you're re-inventing the wheel here?!? Is there any particular reason, given you're using BioPython already, that you're not using the alignment functions already supplied?

from Bio import pairwise2

ADD REPLY • link 10.1 years ago by george.ry ★ 1.2k

score 0 · Answer 1 · 2014-03-04

Some problems/suggestions:

def long(f1):
    for record in SeqIO.parse(f1,'fasta'):
        header = record.name
        sequence = record.seq   
   yield [header, sequence]

If you don't need/want the SeqRecord object from SeqIO, just use SimpleFastaParser instead? It will be faster.

More generally, you don't seem to be looping properly to compare all against all (the short function looks at only the final sequence in the second file). I would expect to see a nested pair of for loops here.

P.S. Have you tried EMBOSS tool needleall? http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html That or one of the other EMBOSS pairwise alignment tools might do what you want.

Update: Regarding the double loop idea, I'd expect something like this:

for record1 in SeqIO.parse(file1, "fasta"):
    for record2 in SeqIO.parse(file2, "fasta"):
        print("Now compare %s from file one against %s from file two..." % record1.id, record2.id))

If file1 had 10 sequences, this would mean looping over file2 10 times. If file2 had 5 sequences, this would mean doing 50 comparisons in total.