I have a reference fasta file containing 130,000 unique sequences (barcodes), each 30nts long. These sequences were synthesized with random incorporation of nucleotides in each position, so they have very large hamming distance. I had a pool of cells each carrying a single barcode, these cells were sorted by FACS, and the genomic DNA extracted. Next, barcode regions were amplified by PCR and sent for 50bp single end illumina sequencing. What I want to do now is to count number of times that each of those 130,000 barcodes present in the fastq file. As the hamming distance is large, I would like to treat sequence with 2 or 3 nucleotides difference as the same sequence. What is the best way to do this? Currently I am testing bwa-mem, but wondering if there are better approaches for this task.
Thanks, this works but it is extremely slow, is there any other alternatives?