Question

Python - Joining Overlapping DNA sequences

0

Entering edit mode

9.0 years ago

st.ph.n ★ 2.7k

Biostars,

I have some DNA sequences, of various lengths where some are forward orientation, and some are reverse. If reverse, they need flipped, and the complement sequence used. Except there's no way to really tell what orientation it is from the start.

I can use the following snippet to join two sequences, by finding the first 21bp of the second sequence in the first sequence, slicing, and joining. The problem is, I have more than 2 sequences, for example 5, there would be many iterations of this snippet looking for all possible combinations. Also, I'm having trouble determining if a sequence needs flipped. The end result needed, is a single super string, taking into account all 5 sequences.

def joinseqs(first, second):
 ""Assumes all input seqs are forward orientation"""
        x = (first, second)
        find_overlap = x[1][:21]
        overlap = x[0].find(find_overlap)
        if overlap < 0:
                return None
        else:
               begin = x[0][:overlap]
               seq = begin + x[1]
               return seq

All help appreciated.

dna python • 5.5k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Hi, I have written a pure python package called pydna that implements a sequence Assembly algorithm based on graph theory. Look at the pydna.Assembly class. It was recently published in BMC Bioinformatics.

Hope this helps / Bjorn

ADD REPLY • link updated 14 months ago by Ram 43k • written 9.0 years ago by bjorn_johansson • 0

Ram · Answer 1 · 2015-05-01

You might try Dedupe in the BBMap package. There's a thread about it here: dedupe.sh output from BBTools

It can very quickly do an all-to-all comparison of an arbitrary number of sequences and print a file containing all detected overlaps, including the orientation. If possible, it will also flip all the sequences so everything is in the same orientation. Then you could parse that file to decide how to assemble the superstring.

dedupe.sh in=x.fq am=t ac=t fo=t c=t cc=t pc=t fmj=t rc=f rnc=f mcs=2 k=15 mo=21 s=1 pto=f ngn=f dot=graph.dot out=flipped.fq

That command will take the sequences in x.fq and find the overlaps, print them in graph.dot, and print the sequences flipped to the same orientation (when possible) in flipped.fq. It also accepts fasta input. The s=1 flag will allow up to one mismatch; you can set that higher or lower if you want. mo sets the minimum overlap length allowed.