Question

Rapid lookup of barcodes

2

Entering edit mode

21 months ago

Michael ▴ 270

I have a list of 90,000 barcodes with length of 15bp.

Now I have to compare another list with millions of 14-16bp long sequences against the barcodes. I want to calculate the levenshtein-distance (edit distance) and keep the matches of distance 0 or 1.

What tool would you ideally use to do this?

I guess it has to be a fast implementation since we have about 90,00 x 3,000,000 = 270 Billion comparisons. So it has to be a method which works with some type of index of the barcodes

levenshtein-distance NGS barcodes edit-distance • 778 views

ADD COMMENT • link updated 21 months ago by ATpoint 82k • written 21 months ago by Michael ▴ 270

1

Entering edit mode

You could try bbduk.sh in filter mode. A guide is available: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/

It can use multiple threads and will be plenty performant as long as you can make it work for your use case.

ADD REPLY • link 21 months ago by GenoMax 141k

0

Entering edit mode

Wouldn't it make sense to generate all possible barcodes based on your 90k file that are within that edit distance, then filter the second list for these (and be it with awk/seen) and then retain those entries that were part of the original 90k file? That way you avoid making these billions of comparisons and it comes down to keeping entries that are matched in a file.

ADD REPLY • link 21 months ago by ATpoint 82k

score 1 · Answer 1 · 2022-07-11

I think bowtie2 can be persuaded to do that with some tweaking of the parameters. I'm not sure if the edit distance it reports is Levenshtein, but maybe you are not too strict about that.

I used bowtie2 to align fastq reads against a library of guide RNAs using these settings:

bowtie2 -x my90k.fa -U my3M.reads.fq -N 1 -L 18 --local -p 8 --score-min L,10,0

Where my90k.fa would be your indexed file of the 90,000 barcodes. In your case, you should reduce -L (seed length) and if you want all matches for each read, add the -a option. Check also the --local option is necessary. If your 3 million reads are in fasta also add the -f option.

(I'd be interested in which solution you go for)