I received a plink/vcf file with a lot of samples genotyped with many different SNP chips, both in size (varying from 50K to 1.5 million) and in platforms (different companies).
I need to find common SNPs across samples.
The file itself has SNP IDs in the format of CHR:BP, therefore I cannot use this to infer which SNP comes from which platform. According to my logic, one could do this by filtering out SNPs with missing genotyping calls (./.), however when I did this, I ended up having very small number of SNPs in common. Also, some individuals might be genotyped for some SNP, but ended up having missed call, so I think this is not a good way to do it. I tried PLINK's --missing command which reports the overall missing genotype calls per individual and sample. This is informative, however, I need to know exact SNPs that are common across individuals.
Is there a way to find this out?
if you can share some part of the data that would be helpful what you are trying and what the final output expected , that helps for people who can troubleshoot
The vcf file looks like this:
It sounds like you already have the answer? There are very few SNP that are genotyped in all samples. Or perhaps I am missing something?
I am interested in whether the reason might not be that individuals are genotyped with different SNP chips but that genotype calls for genotyped SNPs are missing. In both cases I would see ./. in genotype column, as I see now, right? How do I know whether someone is not genotyped for some SNP and because of that there is a missing call (./.), or whether the missing genotype call (./.) is due to the low quality of genotyping (for example) when in reality the sample is genotyped for the SNP. Is there a way to find this from vcf file?