merging a number of overlapping sanger sequences
2
0
Entering edit mode
4.2 years ago
thomas.welch ▴ 50

Hi there,

I have 80 DNA samples in which we have sequences three overlapping sections of a large gene using the sanger method. I am now looking for a way to merge these sequenced segments of the gene into a single sequence for each sample so that they can be aligned for analysis.

I have come across a couple of tools for doing this with just two overlapping sequences (such as emboss), and i've seen that this can be done with bioedit for one sample at a time, but is there a tool that can allow me to do this in bulk. or will i have to align and assemble them as i would with ngs data of a genome?

Kind Regards, Tom

merge sequence alignment • 3.9k views
2
Entering edit mode

You should give tadpole.sh from BBMap a try. It should work with fasta formatted sequences.

0
Entering edit mode

This would be trivially simple if you had access to Sequencher, DNASTAR, ContigExpress from Vector NTI among others (Note: these are all commercial software packages and are not free). Consed suite will work as well but it will require signing an academic agreement and some effort on your part to install everything.

1
Entering edit mode
4.2 years ago
Fabio Marroni ★ 2.8k

I would suggest the suite phred/phrap/consed. It was widely used in the "old" Sanger days, and after all it was working pretty well. Consed is a "finishing" tool, which is nevertheless pretty useful to visualize assemblies and correct errors. The major drawback is that you might have to invest some time to learn how to use them.

0
Entering edit mode
3.2 years ago
ferroao ▴ 20

If you have a fasta with all sequences, you can use this R script

# install libraries and dependencies
# necessary for sangeranalyseR
# for ex.
# BiocInstaller::biocLite("DECIPHER")

# install sangeranalyseR package
library(devtools)
install_github("roblanf/sangeranalyseR")
library(sangeranalyseR)

# read fasta file with several sequences
# make DNAstring objects
# merge sequences
# consensus
merged.reads$consensus BrowseSeqs(merged.reads$alignment)

# write to file


or this python script

python3.4 combineSequences.py -f myfastas.fas -r myout.fas

Script in: https://gitlab.com/ferroao/msa Copied from Rosa Tung https://github.com/rostun/DNA_multiple_sequence_alignment

2
Entering edit mode

Hi ferroao

You claim you've "forked" your code from https://github.com/rostun/DNA_multiple_sequence_alignment, but you've actually copied over their code to a different git site (github vs gitlab). Also, all you've done is made the input and output files command line arguments (and added an unnecessary step to strip empty lines). You have not changed any of the underlying algorithm. Have you at least addressed the 50-sequence, 1000-length, ATCG-only limitations?

I'd like to understand why you're spamming old threads with a script you did not author when the script is 2 years old, has so many limitations and was written as part of what looks like a classroom what is definitely a rosalind challenge?

If you sincerely think the script is performant, please create a Tool type post for it.

0
Entering edit mode

I think my answers is appropriate to this question. You can use the moderate option if you want so. Most limitations you talked about are just about the example.txt not the script. Best,

0
Entering edit mode

No, but I do not appreciate code adapted from repositories without due attribution, especially when the contribution post adaptation is negligible. The code is from a rosalind challenge by an amateur coder, so I am pretty sure it is not as good as established, tested tools. In addition to this, going back to year-old posts to add an answer advertising a poor solution that is ill-adapted on top is not recommended. I will not use the moderate option as what you're doing is not inappropriate, just a little ill-advised.