Hi all,
I have a library of sequencing data that is very diverse. However, due to nature of sequencing machine, that it is hard to say whether a pair of sequences are from the same origin sequence (due to technical error) or actually different based on the (random) designs.
One can set a threshold based on the hamming distance, that smaller than, let say 2 hamming score, is actually just the result of technical error. But from my understanding, the different in some bases can also due to mutation during culture (biological impact) or many other factors. Also, setting a threshold is subjective.
Can anyone suggest some ways or any modeling that take some parameters to check this?