We have finished downloading the massive Simons Diversity Project dataset after a year's effort. It consists of one VCF per chromosome for 260 samples. Each VCF includes one line per nucleotide position. Thus, reference calls and no-calls are explicitly reported.
It appears to me that indels are not described in these VCFs. Are they described in other files, or are they just not reported?.
Below I illustrate with a deletion that is present in the pilot release of 25 genomes, but is absent in the VCF we just downloaded for the same sample.
Insertion 10:118916-118915 is seen in sample HGDP01284 in the VCF for the pilot project (to avoid visual clutter, only first 7 fields shown):
tabix HGDP01284.hg19_1000g.10.mod.vcf.gz 10:118915-118919
10 118915 . A . 78.14 .
10 118915 . A AG 299.55 .
10 118916 . G . 78.15 .
10 118917 . G . 75.13 .
10 118918 . G . 78.11 .
10 118919 . A G 8.07 LowQual
Here is the same region for the same sample in the VCF provided with the full dataset. I do not see anything to suggest an insertion at 118915:
zcat HGDP01284.10.filtered.vcf.gz | grep -A4 -m1 118915
10 118915 . A . 38.99 . AN=2;BaseCounts=14,0,0,0;DP=14;GC=35.66;MQ=34.80;MQ0=1;FL=-1 GT:DP 0/0:14
10 118916 . G . 38.99 . AN=2;BaseCounts=0,0,14,0;DP=14;GC=35.66;MQ=34.80;MQ0=1;FL=-1 GT:DP 0/0:14
10 118917 . G . 35.99 . AN=2;BaseCounts=0,0,14,0;DP=14;GC=35.66;MQ=34.80;MQ0=1;FL=-1 GT:DP 0/0:14
10 118918 . G . 38.99 . AN=2;BaseCounts=0,0,15,0;DP=15;GC=35.41;MQ=34.95;MQ0=1;FL=-1 GT:DP 0/0:15
10 118919 rs201347354 A G 41.01 . AC=1;AF=0.500;AN=2;BaseCounts=11,0,3,0;BaseQRankSum=0.623;DB;DP=15;Dels=0.07;FS=0.000;GC=35.16;HaplotypeScore=7.9923;MLEAC=1;MLEAF=0.500;MQ=34.95;MQ0=1;MQRankSum=-0.934;QD=2.73;ReadPosRankSum=0.311;FL=-1 GT:AD:DP:GQ:PL 0/1:11,3:14:41:41,0,348
Many thanks for any help.