I have performed a multisample variant calling on 81 non-human mammal samples using GATK. I wanted to do allele-specific expression analysis of 4 samples out of the 81 for which I have RNAseq data. For doing that, I selected only biallelic heterozygous variants from those 4 samples from the multisample VCF file to make separate VCFs for each sample and filtered variants whose GQ (Genotype quality) Phred score < 40.
When I compared the statistics of the remaining variants, the number of variants remaining in chromosome 1 of one sample was in the range of 300000 to 400000, but the other samples had >500000 variants remaining in chromosome 1.
Can there be a biological explanation to this? What all additional analyses can be done to explain this difference? Kindly let me know if any more information is needed for this question.