How Many Mismatches For Fastx Barcode Splitter?
1
0
Entering edit mode
11.9 years ago
KCC ★ 4.1k

I am using FASTX barcode splitter. I have 4 barcodes of length 7. How do I figure out how many mismatches to allow? Is 3 too many?

fastx • 3.6k views
ADD COMMENT
0
Entering edit mode

What are the indexes? barcodes?

ADD REPLY
0
Entering edit mode

@Arun: Barcodes. Yes.

ADD REPLY
2
Entering edit mode
11.9 years ago
Arun 2.4k

Usually, barcodes are designed to have a maximum hamming distance of 2. That means, the number of substitutions required to get from barcode A to B is 2. Normally, I would only consider those barcodes that are with NO mismatches. But if you filter out too many reads because of this, owing to mismatches from your sequencer at the beginning of the reads, then to be on the safe side, I would go with a maximum of 1 mismatch to recover more reads. But then, this is my personal choice. Either ways, going more than 2 would be bad, I guess everyone would agree with that. It depends on how many reads are thrown because of barcode mismatch.

ADD COMMENT
0
Entering edit mode

In my case, there is an edit distance of at least 4 between all barcodes. Does this change your answer?

ADD REPLY
1
Entering edit mode

If you're sure about it, then, yes I think so. Maybe to be safe, you can also check if the smallest hamming distance for a given read occurs with only 1 of your barcodes. For example:

 Barcodes (illustrated for length 4): ACGT TGCT CGTT and read barcode is AGTT

Then the mismatches are 2, 2, 1. There is 1 barcode CGTT that matches with the lowest hamming distance. So, with relatively less ambiguity, you might be able to say that this read belongs to that sample with barcode CGTT. Of course there are possibilities that this could be wrong. But probabilistically less I'd say.

Did you try finding out the number of reads that belong to each of the barcodes and see how many reads are you able to find? How many get lost due to 1 mismatch, 2 mismatches etc.. Then maybe you can get an idea of better threshold? Suppose you are able to recover 95% of the reads with 1 mismatch, then maybe its safer that way, isn't it?

So, basically, yes, it would be okay with push the limits to 3 if the distance between your barcodes is at least 4. On the other hand, it would be safe to remove the reads with ambiguity (more than 1 reads with same distance). It might be wiser to find out a threshold by recovering reads with 0,1,2 and 3 mismatches and choosing an appropriate level where you have recovered enough reads.

Good luck!

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6