I have an experiment where there are two types of controls and one treatment. For each, I have R1 and R2 NGS reads, which I have mapped to 70,000 short sequences in a BLAST database. I have summarized by counting the number of duplicates mapped to each reference sequence.
Example data;
Ref.seq …. Cnl.A.R1 ...Cnl.A.R2 ….Cnl.B.R1 ….Cnl.B.R2 …..Trt.R1 …..Tr1.R2
NM_001 …... 10 ………….. 9 ……….. 40 ………….. 56 ………. 323 ……. 212
NM_002 …... 36 …………. 29 ……… 143 …………. 70 ………. 128 ……. 116
NM_003 ….. 430 ……….. 390 …….. 3285 ………. 1933 ….. 112831... 102009
Most duplicate counts are close to zero. A few are quite large.
I would like to determine a confidence score for each reference gene reflecting the probability that the number of duplicates in the treatment group is larger than in either of the controls. Can the spread between the R1 and R2 readings be used to make such a score? Obviously, n=2 is very small. Can the mean R1 - R2 spread across all genes be used, even though the spread increases with the magnitude of the count?
I would very much appreciate any suggestions, as well as any references. Thanks