Question: Need Suggestions For A Greedy Algorithm For Thoroughly Assembling Very Short Reads
1
5.6 years ago by
JacobS890
Cleveland, Ohio
JacobS890 wrote:

I am looking for a painless method for conducting a very small assembly of short sequences based on exact identity. Simply put, I have an NGS sample that I believe is contaminated with a common sequence. I scanned a few million reads and determined the top 50 most abundant kmers of length 25nt. Browsing these top 50 kmers, it is clear that they are mostly staggered windows of a single sequence, and I would like to assemble these 50 kmers by overlapping identity.

Short of writing a perl script, does someone know of a simple way to do this? Thanks!

assembly • 1.2k views
modified 5.6 years ago by Torst900 • written 5.6 years ago by JacobS890
2
5.6 years ago by
Torst900
Australia
Torst900 wrote:

So you have 50 sequences, each of 25bp length, and you believe them to be highly overlapping with virtually 100% identity representing a parent sequence of about 75bp or so?

The simplest thing to is to a multiple sequence alignment (MSA) of the 50 sequences. The consensus sequence will be your contaminant sequence. This is a "poor man's" de novo assembly but fits your situation well.

To do the MSA you can use clustal-omega:

``````clustalo -i kmers.fasta > kmers.aln
``````

To get the consensus, you can use 'cons' from EMBOSS:

``````cons -plurality 0 -sequence kmers.aln -outseq contaminant.fasta
``````

Hi @Torst, thanks for your descriptive explanation! While it certainly solves the problem, I should explain that I am more interested in finding a simple assembler for solving this problem. I would actually like to use such an assembler on the top 500 kmers, which will likely constitute 10 reference seqs, which would hopefully assemble into 10 different kmers. Furthermore, the reads may be from different strands, and I could have top kmers that are inverse-complements of the other kmers, so I would want to assemble while considering every possible orientation. Am I wrong in assuming it would be tedious to complete such a task using clustal-omega?

1

CAP3 would do a good job, but it will need a few parameters tweaked for your situation:

http://seq.cs.iastate.edu/