Question

Writing Script That Compares Results From Multiple Sample Snp Callers By Calculating Ti/Tv Ratio

0

Entering edit mode

11.4 years ago

mubozhi ▴ 20

Dear All,

I've ran 9 different multiple-sample SNP callers on 50 exome samples from the 1,000 Genomes Project and I want to compare the results to see which SNP caller is doing the best job. I've already calculated SNP call concordance values with dbSNP and I think the next step is to calculate transition/transversion (Ti/Tv) ratios. I realize that there are many programs (vcftools, snpEff, etc.) that will calculate the Ti/Tv ratio from the VCF output of a SNP caller, but not all the SNP callers I used output VCF files. Thus I want to write my own script, that will be able to compute the Ti/Tv ratio for all the 9 SNP callers.

The output of each SNP caller usually includes the location of the SNPs, the samples which have these SNPs and the SNP genotype (different from reference) of the samples with these SNPs, amongst other information. Counting the exact number of transitions and transversions of a specific SNP caller's output seems complicated for 2 reasons:

What if two samples are called to have SNPs at the same location, yet they are genotyped differently? i.e. Given that the reference base is A and sample 1 was genotyped as a C and sample 2 was genotyped as a G, is that both 1 transition and 1 transversion?
What about heterozygous SNP calls (i.e. given a reference of A, sample 1 was genotyped AC)? Would these count as either a transition or a transversion, or are heterozygous SNPs generally excluded from Ti/Tv calculations?

I'm assuming that once these issues are resolved, then the Ti/Tv ratio is calculated to be just the total number (over all samples) of transition SNPs divided by the total number (again, over all samples) of transversion SNPs of a SNP caller's output. Is this true?

Thank you!

snp • 5.8k views

ADD COMMENT • link updated 3.1 years ago by Ram 43k • written 11.4 years ago by mubozhi ▴ 20

Ram · Answer 1 · 2012-11-12

1) I'm not sure if there is a general practice, but I have always counted this as two events. 2) Yes, you include heterozygous calls. There is no change in counting for them either. If the reference is A and you have AC and CC, both would count as one transversion. 3) There are two ways to do this. One is to simply look at all of your called sites and calculate the Ti/Tv of those sites, regardless of how many samples have any individual variant. I believe this is generally how the quality of calls for a given cohort is determined (note that this makes #2 irrelevant). The other way is to weight each variant by the number of times it was called across the cohort (this is what you have written I think). I don't believe the latter is wrong, but it won't give you as sensitive a read on the calls being generated by your caller as the former method.

Hope this helps.

Ram · Answer 2 · 2014-10-09

Tri-allelic loci are so rare that you should not have to account for these. You would have to have very strong balancing selection (i.e. on an immune gene) to retain three bases at a site and I would cautious of accepting these.
This is the scenario you would always expect to observe because your sites should be bi-allelic (with respect to that base position)