I have annotated 1000 genomes phase 3 data using Annovar, whilst also using the Gencode annotation (available here: http://www.internationalgenome.org/category/annotation/) and seem to have wildly different numbers of variants. Here are the steps I used to parse the data:
Filter out MNPs and Indels using bcftools.
Use vcftools to remove all variants that do not sit within exon co-ordinates withing bedfiles (obtained using biomaRt).
At this point I annotated using annovar, and then in both the annovar and gencode cases filtered out variants leaving only stop lost, stop gained, synonymous and non-synonymous variants, ensuring to use the correct terminology in each case (as Gencode use the ensembl codes (https://www.ensembl.org/info/genome/variation/predicted_data.html) whilst annovar does not).
At this point I would expect to have two annotated data sets of roughly the same size. However, the annovar annotated data is roughly 900,000 variants in size, whilst the Gencode data set is only 55,400 variants in size.
This is a huge discrepancy and I am not sure which is best to conduct my analysis on, if indeed any.
As an aside, for the Gencode dataset I filtered variants by the consequence determined in the first transcript.