I am looking for a painless method for conducting a very small assembly of short sequences based on exact identity. Simply put, I have an NGS sample that I believe is contaminated with a common sequence. I scanned a few million reads and determined the top 50 most abundant kmers of length 25nt. Browsing these top 50 kmers, it is clear that they are mostly staggered windows of a single sequence, and I would like to assemble these 50 kmers by overlapping identity.
Short of writing a perl script, does someone know of a simple way to do this? Thanks!
Hi @Torst, thanks for your descriptive explanation! While it certainly solves the problem, I should explain that I am more interested in finding a simple assembler for solving this problem. I would actually like to use such an assembler on the top 500 kmers, which will likely constitute 10 reference seqs, which would hopefully assemble into 10 different kmers. Furthermore, the reads may be from different strands, and I could have top kmers that are inverse-complements of the other kmers, so I would want to assemble while considering every possible orientation. Am I wrong in assuming it would be tedious to complete such a task using clustal-omega?
CAP3 would do a good job, but it will need a few parameters tweaked for your situation:
http://seq.cs.iastate.edu/