Question

Distribution of assayed SNPs per sample

0

Entering edit mode

5 days ago

am29 ▴ 60

I received a plink/vcf file with a lot of samples genotyped with many different SNP chips, both in size (varying from 50K to 1.5 million) and in platforms (different companies).

I need to find common SNPs across samples.

The file itself has SNP IDs in the format of CHR:BP, therefore I cannot use this to infer which SNP comes from which platform. According to my logic, one could do this by filtering out SNPs with missing genotyping calls (./.), however when I did this, I ended up having very small number of SNPs in common. Also, some individuals might be genotyped for some SNP, but ended up having missed call, so I think this is not a good way to do it. I tried PLINK's --missing command which reports the overall missing genotype calls per individual and sample. This is informative, however, I need to know exact SNPs that are common across individuals.

Is there a way to find this out?

distribution PLINK missingness • 412 views

ADD COMMENT • link 3 days ago by am29 ▴ 60

0

Entering edit mode

if you can share some part of the data that would be helpful what you are trying and what the final output expected , that helps for people who can troubleshoot

ADD REPLY • link 5 days ago by 1769mkc ★ 1.3k

0

Entering edit mode

The vcf file looks like this:

CHR   BP      SNP ID   GENOTYPE    SAMPLE ID
1     5234    1:5234    ./.        SAMPLE_1

ADD REPLY • link 3 days ago by am29 ▴ 60

0

Entering edit mode

According to my logic, one could do this by filtering out SNPs with missing genotype calls (./.), however when I did this, I ended up having very small number of SNPs in common.

It sounds like you already have the answer? There are very few SNP that are genotyped in all samples. Or perhaps I am missing something?

ADD REPLY • link 5 days ago by mbyvcm ▴ 460

0

Entering edit mode

I am interested in whether the reason might not be that individuals are genotyped with different SNP chips but that genotype calls for genotyped SNPs are missing. In both cases I would see ./. in genotype column, as I see now, right? How do I know whether someone is not genotyped for some SNP and because of that there is a missing call (./.), or whether the missing genotype call (./.) is due to the low quality of genotyping (for example) when in reality the sample is genotyped for the SNP. Is there a way to find this from vcf file?

ADD REPLY • link 3 days ago by am29 ▴ 60