I am working on a project where we are sampling short (8-20bp), highly repetitive regions in tissue samples and comparing them to detect mutations (primarily indels). One challenge is that our reading technique (nextGen sequencing, using molecular tags to correct PCR errors) generates a distribution of alleles rather than one consensus allele for any given sample. What kinds of mathematical models are commonly used for comparing these sorts of distributions? (We have a technique that is working well, but think there must be better.) Beyond this, is there a good methodology for eliminating samples with too few reads to be trustworthy? (We want to avoid false positives.)
Below is an example. There are 8 different sources (A-H) and 2 samples from each (1-2). At least one of the samples from one of the sources has a mutation, and in at least one sample from one of the sources a false mutation is usually found due to low number of reads.
Genotype A-1 A-2 B-1 B-2 C-1 C-2 D-1 D-2 E-1 E-2 F-1 F-2 G-1 G-2 H-1 H-2
ACCCCCCCCCCC 6 11 4 7 8 18 10 7 3
ACCCCCCCCCCCC 21 4 57 34 29 37 14 59 79 19
ACCCCCCCCCCCCC 4 2 3 3 2
ACCCCCCCCCCTC 2
ACCCGCCCCCCCCGCC 25 2 9 12 27 20 70 33 15 4 14 34 36 16
CCCCCCCCCCCC 2 7 18 2 9 12
CCCCCCCCCCCCC 3 21 11 2 17 23
CCCCCCCCCCCCCC 2
Thanks for any advice!