I have recently obtained some Illumina sequencing data (amplicon-seq) where each of the sequences that were amplified is characterised by a distinct (unique) barcode (an 8-mer).
With the help also from people here at Biostars, I collected the different barcodes. My question now is -if anyone has done it or has seen a paper that does this kind of analysis- the following:
If you do not know the barcodes beforehand, and, like me, you end up with, say, 5000 different 8mers. Obviously, some of them were very frequent and some not. However, my only way of approaching how to "decide" if the given 8nt combination I extracted is an actual barcode or not is to base it on frequency and qPCR integrations estimations.
But is this actually correct? I will add 2 examples:
First cell line gave me 40 barcodes which had a frequency of 60,000 - 130,000. Then, then next one in line had a frequency of 1,500 for example. I then said, "ok, since the drop from 60,000 to 1,500 is very big, probably all 8mers with frequency less than 60,000 are PCR artifacts. Do you think this is correct? I mean, I would like to see if there is some kind of publication where they describe how to select/set cut-offs.
Another experiment was more ambiguous, since there the frequency of the barcodes was dropping out "smoothly", like 100,000 - 90,000 ... 10,000, 8,000 ... So no huge jump. What do you do there then?
Any help/idea/publication that you might have come across would be of valuable help!