Question: Searching for a sequence within another sequence Biopython
gravatar for bethpearson98
12 months ago by
bethpearson980 wrote:

I want to search for a recognition site sequence within another plasmid sequence. Currently I am trying to use pairwise2 local alignment but I am not sure what parameters to use. For instance, the plasmid seuqnce is something like:


and the recognition site is somwthing like:


I want something to locate the postion of AAGAAG within the first sequence.

sequence alignment • 475 views
ADD COMMENTlink modified 12 months ago by Joe16k • written 12 months ago by bethpearson980

You might want to try the approach input in this thread:

A: Get genome position and sequence with extra bases of a DNA sequence (python)

The pairwise2 approach would probably work too. Default parameters should be fine.

ADD REPLYlink written 12 months ago by Joe16k
It seems you need a simple python script like this one below:
and the recognition site is something like: 
def posit_in_seq(string,site):

    len_site = len(site)
    if site in string:
        for i in range (0, len(string)):
            if site == string[i:i+len_site]:
                return i
    elif site not in string:        
        return 'No such a site in the string'

print(posit_in_seq('AGTGTTAAGAAGTCCGT', 'AAGAAG'))
print(posit_in_seq('AGTGTTAAGAAGTCCGT', 'GTCCGT'))
print(posit_in_seq('AGTGTTAAGAAGTCCGT', 'ATCCGT'))
    No such a site in the string
ADD REPLYlink modified 12 months ago • written 12 months ago by natasha.sernova3.7k

I have something like this atm but the sequences from the plasmids are very error prone and so it is unlikely that the main sequence will contain the sub sequence exactly. I was trying to use pairwise to get the alignment score but it I'm not sure how to get it to find a point in the first sequence that contains the highest alignment score for the subsequence.

ADD REPLYlink written 12 months ago by bethpearson980

The sequence you've provided is not a whole plasmid sequence, right?

It looks like some gene upstream. In this or similar cases you have to find

orthologous gene upstreams (the same gene from close species), find out, how

your site looks like there, then make a PWM (position weight matrix) - it will give you

a site score, since the site may be different in different plasmids and different upstreams.

This approach will take care about single substitutions in your site. There are many posts

about PWM in 'Biostars', read them to learn more.

In particular, see this post. PWM matching alogrithm

MOODS has modern references I've not seen anywhere else.

ADD REPLYlink modified 12 months ago • written 12 months ago by natasha.sernova3.7k

OP doesn’t need to do any of this.

Their approach with alignment is the correct way to go, they just need to extract the hit coordinates. I assume this is possible from pairwise2 but I haven’t had chance to test it myself yet.

OP, if you’re happy with a non python solution my comment above will do exactly what you need.

Alternatively, you can do this with the regex engine, if you know which bits of the sequence are conserved and which are variable in your query?

ADD REPLYlink written 12 months ago by Joe16k

I 've seen this post that should help to find identical residues.

A: index of the residue matched using pairwise2 in biopython

I hope it may help you.

ADD REPLYlink written 12 months ago by natasha.sernova3.7k

Again, I don't think this is what OP is looking for. They want to know the indices of their sequence in the original plasmid; e.g. if their sequence of interest occurs at base 1012 to base 1020 in the reference plasmid sequence, they want those numbers.

The link you've provided is just showing how to enumerate the match itself (which is not connected in any way that I can see to the original numbering of the input sequence.

ADD REPLYlink written 12 months ago by Joe16k

As far as I have understood, a person who asked a question

was going to get a position of some fragment in a larger sequence.

But that particular fragment may differ a little bit from a real plasmid

sequence. The matrix I suggested a few days ago is an ancient approach,

but it would help. The alignment will show if and where the difference exists.

They will decide what is better for them - there is no reaction so far.

ADD REPLYlink written 12 months ago by natasha.sernova3.7k

On closer inspection, I don't think there's any way to get the information you want directly from pairwise2. Do you still specifically want a Biopython solution?

ADD REPLYlink written 12 months ago by Joe16k
gravatar for Joe
12 months ago by
United Kingdom
Joe16k wrote:

Here's one approach. It's a little hacky, and going back to the original sequence to find the indicies might be unecessary, but I'm not 100% sure how pairwise2 behaves with local alignments (if it only returns the local match, and not the whole sequence then it is necessary), otherwise, the numbers you want are in the min and max values of the coords array.

from Bio import SeqIO, pairwise2
from Bio.pairwise2 import format_alignment    #optional
import sys

querystring = sys.argv[2]

for record in SeqIO.parse(sys.argv[1], 'fasta'):
    aln = pairwise2.align.localms(record.seq, querystring,
                                   5, -4,      # match, mismatch
                                   -2, -0.5)   # open, extend

    [print(format_alignment(*a)) for a in aln]     #optional

    coords = [i for i in range(len(aln[0][0])) if aln[0][0][i] == aln[0][1][i]]
    subseq = aln[0][0][coords[0]:coords[-1]+1]
    seqstart = str(record.seq).index(subseq)

    print('Subseq coordinates: {} -> {}'.format(seqstart, seqstart+len(subseq)))


This is hacky as I said, and there may be an off-by-one error or two (ironic huh?) in the output numbering, so please check this.

I've chosen some alignment values that seemed to give me reasonable alignments in my testing, but you may need to tweak these to suit. I suspect gap opening may need to be penalised more strongly to promote a 'tight' local match so that the minimum amount of extraneous sequence is selected.

ADD COMMENTlink modified 12 months ago • written 12 months ago by Joe16k

I just want to add that with the next release of Biopython (ver. 1.74), format_alignmentof pairwise2 will prettily print local alignments (only the aligned part) with start positions (in 1-based notation).

ADD REPLYlink modified 12 months ago • written 12 months ago by Markus270

Thanks Markus!

I thought that might be the case as I mentioned in the post, but as I was only testing with short sequences vs other short sequences (and am not super familiar with what pairwise2 returns in local matches), I wanted to make sure the indices corresponded to the original sequence and not just the aligned regions.

You say its a 'special case', and not necessary in this instance, is that just because this is a short sequence? I'm trying to better wrap my head around the output/limitations of pairwise for future reference.

Good point about the aln[0], I should have mentioned in my post that I was making the assumption that OP would only be interested in the best match.

ADD REPLYlink written 12 months ago by Joe16k

I mean 'special' in contrast to general, we have a well defined use case: it's a local alignment and both sequences are clearly different in size. Because it's a local alignment, there will be no gaps in the 'unmatched' regions, so you don't need to take care of them (e.g. subtract the number of gaps). And because the plasmid is the bigger sequence, it's the one that 'dominates' the numbering of the aligned sequences:

0123456789  position
AAAGGGGCCC  'plasmid'
---GGGG---  'binding-site-motif'

The pairwise2 alignment result would be [('AAAGGGGCCC', '---GGGG----', score, 3, 7)].

ADD REPLYlink modified 12 months ago • written 12 months ago by Markus270

I think the lower part (finding the actual start position within the plasmid sequence) is a little bit too complicated in this special case. pairwise2returns the aligned sequences, the score, and the begin and end values, which are the Python indices for the aligned part of the aligned sequences. However, since you are using a local search, there will be no gaps at the beginning of the alignment and because you are looking for a short sequence (binding site) within a long sequence (plasmid), start = aln[0][3] will give you the start position of your binding site within the plasmid sequence (in 0-based notation, so start = 3 means base number 4). Please note that there are possibly more than one hit in your alignment (e.g.: aln[1]) so you may want to iterate through them (e.g. for ali in align: print(ali[3]) to see all possilbe starting points), however, only those alignments with the same maximum score will be reported.

ADD REPLYlink modified 12 months ago • written 12 months ago by Markus270
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour