I'm a PhD student working with whole genome data from multiple individuals from 3 populations. All the sequencing is low-coverage and combined in a vcf file. I have also created population specific vcf files for the 3 popns using GATK select variants.
I would like to compare 2 of the populations to create a file that contains only substitutions (fixed differences) between the populations. So where they are homozygous for different alleles. I have looked for a way to do this in GATK and vcf tools but found none. Unfotuntately nobody in my group has experience doing this sort of thing. If anyone can recommend a program that can do this kind of thing or can write a basic script that would be really helpful, I have only basic scripting experience in python.
It seems to me that custom scripting is the best way to go to do population genetic types of analyses on whole genome data, I haven't found any good packages, but I'll explore biopython in more detail.
Some specific details that might help:
--all samples from each population share same name in sample id, ie. Ta1, Ta2, Ta3 in one pop and Ag1, Ag2, Ag3 in another pop.
--I have low coverage so would be good to have a filter that only calls a variant as a substitution if there is data from atleast 10 samples from each population support this (I have 20 samples in each population). (so not calling a Substitution if only 9 individuals have reads at this site and also not calling it if 15 samples support it but one individual has a different variant-i'd rather exclude these false negatives for now to keep false positive rate low too)
Please let me know if this question is badly phrased so I can improve it.
Any help at all is greatly appreciated!
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Agg50 Agg51 Agg52 Agg53 Agg54 Agg55 Agg56 Agg57 Agg59 Agg61 Agg62 Agg63 Agg64 Agg65 Agg66 Agg70 Agg73 Agg77 Agg78 Agg79 PhiX Tar11 Tar12 Tar13 Tar14 Tar15 Tar16 Tar17 Tar18 Tar19 Tar20 Tar21 Tar22 Tar23 Tar24 Tar25 Tar26 Tar27 Tar28 Tar39 Tar40 Will41 Will42 Will43 Will44 Will45 Will46 Will47 Will48 Will49 Will50 conflict unknown wrong chr10 48 . G T 432.37 PASS AC=30;AF=1.000;AN=30;DP=16;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.1893;MQ=38.84;MQ0=0;QD=27.02 GT:AD:DP:GQ:PL ./. ./. 1/1:0,2:2:5.99:53,6,0 1/1:0,1:1:3:28,3,0