I recently received some data from an Illumina HiSeq lane, in which the reads are all supposed to start with one of ten different 6bp barcodes. But when I try to split up the data by barcode, more than half of the reads don't actually match any of the barcodes. I imagine that this is due to the relatively high error rate of HiSeq during the first few cycles.
How do you usually deal with this? I assume one option would be some sort of fuzzy matching from real sequence to barcode (e.g. allow for one different nt), although I'm afraid this might introduce a whole new of bias. Is there any software for this out there?