Hi all, I started using biopython for DNA sequence Analysis a while ago. I am know facing a more complex problem and I wonder if there might be a straight forward solution in Biopython which I have missed so far. I have a attached a short description of my problem in the end of that post for everyone who is not interested in the long description ;)
Detailed description of my problem:
I would like to reconstruct a DNA sequence in biopython starting from multiple reads (most likely 3 or 4 reads which cover the whole DNA sequence of a protein). The reads might be combination of forward and backwards reads and therefore of them might contain information of the sense and others of the anti-sense strand. I know the sequencing primer and the position of such primers for all reads. So it would easily be possible to get the reads in the right order and transform all of them in the sense strand for example. My problem is that there are will be overlap between the different reads which not always will have the same length. So for example if my protein is 999 bases long, I wont get three reads which sum up to 999 bases (3x333). I will most likely get three reads suming up to 1400 bases. Is there a straight forward way in python to remove the redundant information in the overlap regions between different reads and convert the reads into one sequence? I also know the reference sequences (on DNA level) of all proteins I am looking for in my analytical data. My plan was to first construct a single sequence out my reads and than compare that sequence to all reference sequences in order to identify if the sequence (i.e. the reads) in my analytical data set corresponds to one of the reference proteins. Since all reference proteins share a common framework and only differ in several specified positions, it would be possible to align the reads onto the reference sequence(s) in order to construct a single sequence (maybe an alignment is not even necessary since I know the sequencing primers, their order and position). Unfortunately, I have no easy idea how to solve the problem of sequence overlaps between the reads.
Is there an "easy" or straight forward solution in biopython for constructing a single sequence from multiple reads with overlapping regions in biopython?
Thank you in advance!