I would like to follow this paper to do the analysis of Genome-wide DNA polymorphisms by using four cultivars re-sequencing data. I would like to make this kind of table and picture in this paper Table 2 Frequency of SNPs and InDels detected in PSRR-1, Bengal, and Nona Bokra., Figure 1 Genome-wide discovery of DNA polymorphisms by whole genome sequencing differentiates weedy and cultivated rice Thus I'm curious about how to compare the SNP or Indel number between two cultivars in a combined VCF file.
If I'm understanding it correctly, this boils down to first identifying all sites that are polymorphic between two cultivars (ie, an allele present in one cultivar that is absent in the other). You can filter a VCF file using expressions in the bcftools filter program based on genotypes and on polymorphism type (SNP or INDEL) to create a file that only contains the differences between cultivars of the type you are looking for. From there you can just count the number of non-header lines in the file to get the number of polymorphisms, or run window-based analyses using something like bedtools makewindows and bedtools count.
Assuming the simple case that you only have two individuals per VCF file, one of each cultivar type, and that these individuals are diploid, you can use something like this to extract all polymorphic sites:
bcftools filter --include '(FORMAT/GT = "hom" && FORMAT/GT = "het") || (FORMAT/GT = "het" && FORMAT/GT = "hom") || (FORMAT/GT = "AA" && FORMAT/GT = "RR") || (FORMAT/GT = "RR" && FORMAT/GT = "AA")' input_file.vcf > output_polymorphic_file.vcf
Separately you can pull out SNPs or INDELs using:
bcftools filter --include 'TYPE="snp"' input_file.vcf > output_file.snps_only.vcf bcftools filter --include 'TYPE="indel"' input_file.vcf > output_file.indels_only.vcf