I am using FASTX barcode splitter. I have 4 barcodes of length 7. How do I figure out how many mismatches to allow? Is 3 too many?
I am using FASTX barcode splitter. I have 4 barcodes of length 7. How do I figure out how many mismatches to allow? Is 3 too many?
Usually, barcodes are designed to have a maximum hamming distance of 2. That means, the number of substitutions required to get from barcode A to B is 2. Normally, I would only consider those barcodes that are with NO mismatches. But if you filter out too many reads because of this, owing to mismatches from your sequencer at the beginning of the reads, then to be on the safe side, I would go with a maximum of 1 mismatch to recover more reads. But then, this is my personal choice. Either ways, going more than 2 would be bad, I guess everyone would agree with that. It depends on how many reads are thrown because of barcode mismatch.
If you're sure about it, then, yes I think so. Maybe to be safe, you can also check if the smallest hamming distance for a given read occurs with only 1 of your barcodes. For example:
Barcodes (illustrated for length 4): ACGT TGCT CGTT and read barcode is AGTT
Then the mismatches are 2, 2, 1. There is 1 barcode CGTT
that matches with the lowest hamming distance. So, with relatively less ambiguity, you might be able to say that this read belongs to that sample with barcode CGTT
. Of course there are possibilities that this could be wrong. But probabilistically less I'd say.
Did you try finding out the number of reads that belong to each of the barcodes and see how many reads are you able to find? How many get lost due to 1 mismatch, 2 mismatches etc.. Then maybe you can get an idea of better threshold? Suppose you are able to recover 95% of the reads with 1 mismatch, then maybe its safer that way, isn't it?
So, basically, yes, it would be okay with push the limits to 3 if the distance between your barcodes is at least 4. On the other hand, it would be safe to remove the reads with ambiguity (more than 1 reads with same distance). It might be wiser to find out a threshold by recovering reads with 0,1,2 and 3 mismatches and choosing an appropriate level where you have recovered enough reads.
Good luck!
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What are the indexes? barcodes?
@Arun: Barcodes. Yes.