Python - Joining Overlapping DNA sequences
1
0
Entering edit mode
9.0 years ago
st.ph.n ★ 2.7k

Biostars,

I have some DNA sequences, of various lengths where some are forward orientation, and some are reverse. If reverse, they need flipped, and the complement sequence used. Except there's no way to really tell what orientation it is from the start.

I can use the following snippet to join two sequences, by finding the first 21bp of the second sequence in the first sequence, slicing, and joining. The problem is, I have more than 2 sequences, for example 5, there would be many iterations of this snippet looking for all possible combinations. Also, I'm having trouble determining if a sequence needs flipped. The end result needed, is a single super string, taking into account all 5 sequences.

def joinseqs(first, second):
 ""Assumes all input seqs are forward orientation"""
        x = (first, second)
        find_overlap = x[1][:21]
        overlap = x[0].find(find_overlap)
        if overlap < 0:
                return None
        else:
               begin = x[0][:overlap]
               seq = begin + x[1]
               return seq

All help appreciated.

dna python • 5.5k views
ADD COMMENT
0
Entering edit mode

Hi, I have written a pure python package called pydna that implements a sequence Assembly algorithm based on graph theory. Look at the pydna.Assembly class. It was recently published in BMC Bioinformatics.

Hope this helps / Bjorn

ADD REPLY
1
Entering edit mode
9.0 years ago

You might try Dedupe in the BBMap package. There's a thread about it here: dedupe.sh output from BBTools

It can very quickly do an all-to-all comparison of an arbitrary number of sequences and print a file containing all detected overlaps, including the orientation. If possible, it will also flip all the sequences so everything is in the same orientation. Then you could parse that file to decide how to assemble the superstring.

dedupe.sh in=x.fq am=t ac=t fo=t c=t cc=t pc=t fmj=t rc=f rnc=f mcs=2 k=15 mo=21 s=1 pto=f ngn=f dot=graph.dot out=flipped.fq

That command will take the sequences in x.fq and find the overlaps, print them in graph.dot, and print the sequences flipped to the same orientation (when possible) in flipped.fq. It also accepts fasta input. The s=1 flag will allow up to one mismatch; you can set that higher or lower if you want. mo sets the minimum overlap length allowed.

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion Brian, but I want to do this in Python. It's part of a much larger script that I've implemented into a GUI with tkinter for my other lab-mates to utilize.

ADD REPLY

Login before adding your answer.

Traffic: 1470 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6