I've ran 9 different multiple-sample SNP callers on 50 exome samples from the 1,000 Genomes Project and I want to compare the results to see which SNP caller is doing the best job. I've already calculated SNP call concordance values with dbSNP and I think the next step is to calculate transition/transversion (Ti/Tv) ratios. I realize that there are many programs (vcftools, snpEff, etc.) that will calculate the Ti/Tv ratio from the VCF output of a SNP caller, but not all the SNP callers I used output VCF files. Thus I want to write my own script, that will be able to compute the Ti/Tv ratio for all the 9 SNP callers.
The output of each SNP caller usually includes the location of the SNPs, the samples which have these SNPs and the SNP genotype (different from reference) of the samples with these SNPs, amongst other information. Counting the exact number of transitions and transversions of a specific SNP caller's output seems complicated for 2 reasons:
1) What if two samples are called to have SNPs at the same location, yet they are genotyped differently? i.e. Given that the reference base is A and sample 1 was genotyped as a C and sample 2 was genotyped as a G, is that both 1 transition and 1 transversion?
2) What about heterozygous SNP calls (i.e. given a reference of A, sample 1 was genotyped AC)? Would these count as either a transition or a transversion, or are heterozygous SNPs generally excluded from Ti/Tv calculations?
I'm assuming that once these issues are resolved, then the Ti/Tv ratio is calculated to be just the total number (over all samples) of transition SNPs divided by the total number (again, over all samples) of transversion SNPs of a SNP caller's output. Is this true?