Question: Best local aligner for finding fusion events?
1
gravatar for flamholz
4.5 years ago by
flamholz10
United States
flamholz10 wrote:

I am working with sequencing data from a transposon mutagenesis library. I want to identify the exact site of transposon insertion into a plasmid (i.e. 10 kb genome). I have many millions of 100 bp unpaired reads off the plasmid. I have code that aligns to the plasmid and the transposon sequence and performs some DNA math on reads matching both sequences.

I have tried a number of different aligners - BLAT, bowtie2, subalign - and all of them fail to identify 100% of true positive insertions in a synthetically generated library of reads with 20 bp of homology both to plasmid and transposon. The best I can get is about 92% recall. Is there an aligner out there designed for this task? Identification of perfect or near perfect 15-20 bp matches within short reads? It is important that the software be able to find matches that begin in the middle of the read, of course.

Thanks for your help!

transposon alignment • 1.0k views
ADD COMMENTlink modified 4.5 years ago by Brian Bushnell16k • written 4.5 years ago by flamholz10
1

Recall of 92% sounds pretty good, to be honest.

ADD REPLYlink written 4.5 years ago by Sean Davis25k

I would expect any tool which is actually meant for this kind of task to perfectly map a synthetic dataset with 20bp overlap at the insertion site. Even once you move into real data (with sequencing errors & variable degrees of overlap) an 8% rate of unmapped reads seems like a lot.

ADD REPLYlink written 4.5 years ago by alex.rubinsteyn130

I guess the definition of "recall" isn't clear.  I was assuming that the "recall" was for an aligner to correctly align a read with a synthetic "insertion" in it.  I agree that a mapping rate of 92% for synthetic data would be unexpected.  Perhaps a clarification of "recall" would be useful, but my comment wasn't a very academic one, more of an intuitive comment, so it may not be important to follow up on it.

ADD REPLYlink written 4.5 years ago by Sean Davis25k
2
gravatar for Brian Bushnell
4.5 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

I suggest using BBDuk for this purpose; it's designed for exact and near-exact kmer matches.  For the transposon:

bbduk.sh in=reads.fq outm=matched.fq k=20 hdist=1 mm=f literal=ACGTACGTACGTACGTACGT

For the plasmid, use "ref=plasmid.fa" instead of the "literal" flag.

BBDuk can also trim the reads based on where the transposon is, so you can then then map the remainder to the plasmid, or something like that.

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Brian Bushnell16k

Brian -- this looks great so far. Can I get BBDuk to output the boundaries of the match? Or does it only output the reads that match and not the position of the match within the read?

ADD REPLYlink written 4.5 years ago by flamholz10

In a single pass, it will only filter the reads or trim them or mask them.  But, after first filtering, you can get the boundaries of the match like this:

bbduk.sh in=reads.fq out=matched.fq kmask=lc k=20 hdist=1 mm=f literal=...

That will convert the matching bases to lowercase, rather than doing a filter operation.  You can alternately convert them to N with kmask=N, or another symbol that is easy to parse (like kmask=Z).

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1846 users visited in the last hour