Pattern Alignment And Report Identity
1
0
Entering edit mode
10.1 years ago
abhiwash • 0

I have 2 fasta files with sequence's in it.i want to align the sequences in second file to first file and report identity

for example:

file1:

>s1
aaccggactggacatccg 
>s2
gtcgactctcggaattg
....

file2:

 >a1
actg
>a2
tccg
 ....

i want to take the file2 sequences and look in file1 and print the matching with mismatched base in uppercase and identity in csv format

output

name,a1_alignment,a1_identity,a2_alignment,a2_identity
s1,actg,100,tccg,100
s2,aCtg,95,tcCg,95

here what i did so far

import sys
import os,csv
from Bio import SeqIO
from itertools import *
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-m", "--mismatch_threshold", dest="mismatch_threshold", default = 2,
          help="This is the number of differences you'll allow between the   actualread and your sequence of interest.  Default is 2")
(options, args) = parser.parse_args()

if len(sys.argv) != 4:
  print "Usage : python search.py <file1> <file2> <fout>"
sys.exit()

f1 = open(sys.argv[1],'r')
f2 = open(sys.argv[2],'r')
fout = open(sys.argv[3],'w')
writer = csv.writer(fout)

def long(f1):
    for record in SeqIO.parse(f1,'fasta'):
        header = record.name
        sequence = record.seq   
   yield [header, sequence]

def short(f2):
    for record in SeqIO.parse(f2,'fasta'):
        head = record.name
        seq = record.seq  
    return seq

def alignment(sequence,seq,mismatch_threshold):
    l1 = len(sequence)
    l2 = len(seq)
    alignment = []
    for i in range(0,min(l1,l2)):
        if sequence[i] == seq[i]:
            alignment.append(i)
        else:
            mismatch = sum( c1 != c2 for c1,c2 in zip(sequence,seq))                                    
                if mismatch <= mismatch_threshold:                           
                         alignment.append(i)
    k = 0
l = 0
for read in alignment:
    for letter in read:
        if letter == isupper():
            pass
        else:
            if letter == alignment[0].seq[j]:
                l +=1
        k += 1
    k = 0
    length = seq
    percent = 100*l/len(seq)
    #print percent
    yield percent


longsequences = long(open(sys.argv[1],'r'))
shortsequences = short(open(sys.argv[2],'r'))
align = alignment(longsequences,shortsequences,options.mismatch_threshold)

for name in head:
    writer.writerow(( name +'_alignment' , name + '_identity'))
for s in align:
# print to csv file

i am confused and have problem in reading the the sequences in the def function getting this error

what is the best of looping the sequences of file2 and aligning them to sequences in file1 and find an alignment with mismatches?

File "search.py", line 34, in alignment
    l1 = len(sequence)
TypeError: object of type 'generator' has no len()
python biopython alignment • 2.7k views
ADD COMMENT
1
Entering edit mode

Looks like you're re-inventing the wheel here?!? Is there any particular reason, given you're using BioPython already, that you're not using the alignment functions already supplied?

from Bio import pairwise2
ADD REPLY
0
Entering edit mode
10.1 years ago
Peter 6.0k

Some problems/suggestions:

def long(f1):
    for record in SeqIO.parse(f1,'fasta'):
        header = record.name
        sequence = record.seq   
   yield [header, sequence]

If you don't need/want the SeqRecord object from SeqIO, just use SimpleFastaParser instead? It will be faster.

More generally, you don't seem to be looping properly to compare all against all (the short function looks at only the final sequence in the second file). I would expect to see a nested pair of for loops here.

P.S. Have you tried EMBOSS tool needleall? http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html That or one of the other EMBOSS pairwise alignment tools might do what you want.

Update: Regarding the double loop idea, I'd expect something like this:

for record1 in SeqIO.parse(file1, "fasta"):
    for record2 in SeqIO.parse(file2, "fasta"):
        print("Now compare %s from file one against %s from file two..." % record1.id, record2.id))

If file1 had 10 sequences, this would mean looping over file2 10 times. If file2 had 5 sequences, this would mean doing 50 comparisons in total.

ADD COMMENT
0
Entering edit mode

can i get help in looping the short sequences to long sequences,as i new to this nested loops @Peter

ADD REPLY

Login before adding your answer.

Traffic: 1463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6