Question: Aligner for super-short sequences
0
gravatar for IV
3.7 years ago by
IV1.2k
USA
IV1.2k wrote:

I have a long list of very short sequences (6-10nt long) that have to be aligned against a very small part of a mammalian genome (2-5Mb).

The region is not continuous but derived from a few thousand smaller fragments. The index file/database is created as the collection of these individual fragments (not concatenated).

The problem is that mismatches + indels (up to two per sequence) are expected and I want to keep all the multimapped positions.

Tha aligner should be quite fast and as exhaustive as possbile.

Already tried bwa, novoalign, blast, gem, bowtie2 and soap + dynamic programming (which works but is very slow).

All the tested aligners have different problems with this task.

Until now bowtie2 seems to be more thorough but takes significant time to run in local mode.

Any suggestions? Since the sequences are quite short the aligner does not have to be derived only from the NGS cosmos.

Some further info based on the received comments:

The sequences are binding conformations and the index sequences are bona fide binding sites. Therefore there are no false positives and we don't expect unique alignments for each sequence.

The optimal aligner should return results residing in the best stratum: i.e. if there are 50 matches with no mismatches then there is no need to report results with e.g. 2 mismatches and an indel.

The number of returned results is not a problem but is actually desired. Therefore the aligner should report all multimaps belonging to the same stratum.

The identified sites will be filtered based on site properties and the true sites can be identified. However, you need the candidate regions to do the filtering.

We have found aligners that can partly solve the problem but some do not report all multimaps, others have issues with indels or are too slow.

We have started coding an in-house aligner for the task but it would save time if there is something ready out there.

--Edit2 based on comments

I would like to thank fellow biostars for trying to point us in another way and help us avoid a mistake or loss of time.

However what has been asked is exactly what we're looking for: "Does anyone know of a (fast) aligner that can handle (many) multimaps and indels for short sequences".

alignment • 1.4k views
ADD COMMENTlink modified 3.7 years ago by dariober10.0k • written 3.7 years ago by IV1.2k
3
gravatar for Brian Bushnell
3.7 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I'm not sure that 6bp sequences will be at all useful, particularly with indels.  A 6bp sequence with 2 edits would map virtually everywhere.  What exactly are you trying to do?

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Brian Bushnell16k

The indexed regions are not many and quite small. Even if a sequence maps in multiple positions, it's not a problem because they can be filtered afterwards. Nevertheless, what is important is to get those positions, despite their relatively large number. We're looking for an aligner that can manage such a task (despite its uniqueness) and that can do it as fast and as exhaustively as possible.

ADD REPLYlink written 3.7 years ago by IV1.2k
1

I don't know of a fast aligner that can solve this problem.  But, BBDuk can sort-of solve it using kmer matching.  It does not produce sam output, but it can mask all locations in the reference that match your criteria (by changing them to lowercase or to a custom symbol), and you could later use a simple script to parse those locations.  I'm not sure if this addresses your problem.  But, for example, if you had these two files -

genome.fa:

>1
AAACCCTTTGGAAACCCTTTTTTTGG

short.fa:

>1
AAACCC

Then you could run this:

bbduk.sh in=genome.fa ref=short.fa out=masked.fa kmask=lc k=6 mm=f

...and the output would be:

>1
aaacccTTTGGaaacccTTTTTTTGG

You can set "edist=2" to allow an edit distance of 2.  It will do this very quickly and absolutely exhaustively, with no heuristics applied.  If you want to be able to tell which features are being masked by which sequence, though, you'd have to run it once per sequence.  It's fast enough that that won't really be a problem, though, just annoying.

ADD REPLYlink written 3.7 years ago by Brian Bushnell16k

Brian BBDuk seems realy impressive (for this issue and also for read decontamination). I will definitely check it out.

ADD REPLYlink written 3.7 years ago by IV1.2k

Is edit distance penalty the same for mismatches and indels?

ADD REPLYlink written 3.7 years ago by IV1.2k

Yes...  there's no penalty for either.  But, actually, I need to correct what I said - I forgot that it does not currently handle more than 1 deletion.  So, for edit distance 2, it handles:

1 mismatch

1 insertion

1 deletion

2 mismatches

1 mismatch + 1 deletion

1 mismatch + 1 insertion

2 insertions

1 insertion + 1 deletion

but not 2 deletions.

I might add that in the future, though, just for completeness.  Incidentally, at edit distance 3, it can do 2 deletions and 1 insertion.  I generally don't use edit distance anyway, just hamming distance, which is more relevant for decontamination of Illumina reads.

ADD REPLYlink written 3.7 years ago by Brian Bushnell16k

After a few days of testing, BBDuk seems to be the fastest solution and it does return tons of results more than anyting else we've tried (which is a good thing). We get as many hits as with a classic dynamic programming approach.

The only issue with this solution is that you don't get where and which kmer of your larger input sequence (for instance a 10nt input vs a 30nt sequence, with a 7mer search) binds. In some cases you get too many lower case results.

Therefore I have to run a local alignment algorithm in the reduced search space following bbduk which increases run time. Is there a way to keep the information of where has a kmer bound (and which) directly from BBDuk?

 

ADD REPLYlink written 3.6 years ago by IV1.2k
1

The only way to get more information, currently, is to do multiple runs.  E.G. Make a set of all your 30bp sequences, and run BBDuk with k=30; then make a set of 29bp sequences, and run BBDuk with K=29, etc.  This will at least eliminate the problem of excessive masking.

I could modify BBDuk (or actually its successor, Seal) to produce sam files, or something similar, indicating which reference sequence hits which location.  I hadn't really thought about it because you're using it backwards from the way I normally use it, but I'll look into it.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Brian Bushnell16k

That's what we're doing at the moment (multiple runs) but it takes longer to run.

A SAM or SAM-type output would be more than ideal.

ADD REPLYlink written 3.6 years ago by IV1.2k

Any luck in adding SAM support to seal/BBDuk? :)

ADD REPLYlink written 3.6 years ago by IV1.2k

Not yet :)  It's on my list!

ADD REPLYlink written 3.6 years ago by Brian Bushnell16k

Even for a 10-mer, my rough estimate is that you will see ~2000 random 2-diff hits in a 2Mbp region. Tens of thousands of false hits for a 6-mer. I would first try my filtering strategy on some quick-and-dirty mapping. If I saw no signal, no point to seek the best possible mapping.

ADD REPLYlink written 3.7 years ago by lh331k

As Brian says, it'd be easier to help here if you could explain the biological motivation behind this questio

ADD REPLYlink written 3.7 years ago by aidan-budd1.9k
2
gravatar for fred.s.kremer
3.7 years ago by
Brazil
fred.s.kremer70 wrote:

Have you tried SSAHA?

 

https://www.sanger.ac.uk/resources/software/ssaha/#t_4

ADD COMMENTlink written 3.7 years ago by fred.s.kremer70

One of our two in-house-developed algorithms is similar to ssaha. I will check out the paper and the algorithm to see if it can be of assistance. Thanks for the pointer!

ADD REPLYlink written 3.7 years ago by IV1.2k
1
gravatar for dariober
3.7 years ago by
dariober10.0k
WCIP | Glasgow | UK
dariober10.0k wrote:

I second the suggestion others have given about better defining the biological question. But for the moment I'd suggest vmatch. It should be fast and sensitive enough. For example:

Index the reference, use -pl small enough to map small queries:

mkvtree -db ref.fa -dna -pl 4 -allout -v

Then map queries; allow alignments to differ by up to 10% and with evalue up to 10:

vmatch -d -p -showdesc 0 -e 10p -evalue 10 -complete -q queries.fa ref.fa

 

ADD COMMENTlink written 3.7 years ago by dariober10.0k
0
gravatar for andrew.j.skelton73
3.7 years ago by
London
andrew.j.skelton735.6k wrote:

If I remember right, bowtie (the original), was designed for pretty short reads. Your other option is probably Velvet - Any reason why your reads are so short?

ADD COMMENTlink written 3.7 years ago by andrew.j.skelton735.6k

Unfortunately bowtie1 cannot handle indels. The experimental design imposes the very short sequence length. Unfortunately it cannot be changed.

ADD REPLYlink written 3.7 years ago by IV1.2k
0
gravatar for mikhail.shugay
3.7 years ago by
mikhail.shugay3.3k
Czech Republic, Brno, CEITEC
mikhail.shugay3.3k wrote:

Are you looking for some motifs, e.g. TFBS in gene promoters? If yes, then you can use MEME suite or similar application for it. +1 to Brian's answer, it looks like we have some sort of XY problem here.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by mikhail.shugay3.3k

MEME is for finding motifs and is not relevant with the posed question.

ADD REPLYlink written 3.7 years ago by IV1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1057 users visited in the last hour