Hello everyone,
I was wondering if you could help me with an analysis I've been doing on some alleles. My project consists on determining the presence/absence of 5 different alleles in malaria, however I've found it hard to find the threshold of which allele is actually there and what is noise. For my analysis, I used a set of .fastq files (20 of them) and input them in a SRST2 (similar to MLST) along with a database of my allele references. After this, I obtained something like this:
Sample #; Allele_name; Coverage; Depth; Diffs; Divergence; Length; maxMAF
17; FC27; 100%; 7743.733; 8snp; 2.439%; 328bp; 0.054
17; MAD20; 100%; 5671.843; 15snp; 4.808%; 312 bp; 0.487
17; 3D7; 100%; 647.33; 3snp; 0.488; 615; 0.493
17; R033; 100%; 5.043; no differences; 0; 162; 0.5
As you can see in these results for sample # 17, It shows four potential alleles present in my sample, however I am not sure what threshold to use as to determine if the allele 3D7 is present, given it has 647 reads covering the area, or if that's too low to say it is actually there. Some other helpful information for this would be the following:
Sample 17 has an average of 44,424 reads with a mean size of 207 bp. The size of the alleles are the following:
3D7 = 625 bp; FC27 = 611 bp; MAD20 = 316 bp; R033 = 162 bp
Thanks a lot!
I am not familiar with the program, however, the default for depth appears to be 5:
Taken from: https://github.com/katholt/srst2
Hi! Thanks for the answer! However I was more inclined to get the threshold of knowing how many reads I need to get in order to state that 'x' allele is present and it's not noise or contamination. I'm not sure if getting more than 5 reads (let say 6) would be significant for it. Do you know what sort of stats can I use or perhaps hypotheses testing could be useful for this?
Our empirical data from the National Health Service (NHS) England indicated that the minimum position read depth at which you should be calling a variant is 18. This is for germline variants that would be heterozygous or homozygous. If it's a heterozygous call, then, for example, reads could be distributed 10 and 8.