Question

A question about genotyping rate

0

Entering edit mode

6 months ago

abedkurdi10 ▴ 190

Hello everyone,

I have four PLINK samples. I harmonized the samples using Genotype Harmonizer in presence of a reference panel. The genotyping rate, for each PLINK sample is around 0.98-0.99. When I merge the four PLINK sets, the genotyping rate drops to 0.76 in average.

Does anyone knows what could affect the genotyping rate?

Thank you!

genotyping snps harmonizer array • 908 views

ADD COMMENT • link 6 months ago by abedkurdi10 ▴ 190

0

Entering edit mode

Can you please explain your approaches before merging the datasets? Are your datasets in the same genomic build? Were there any allele flips? How do you deal with A/T and G/C SNPs? Did you merge common SNPs between the datasets? Genotyping rate should not drop that low after merging the data.

ADD REPLY • link 6 months ago by bk11 ★ 2.4k

0

Entering edit mode

Yes, my datasets are in the same genomic build. Of course, There were some allele flips.

I am using Genotype Harmonizer: https://github.com/molgenis/systemsgenetics/wiki/Genotype-Harmonizer, it takes care of everything, A/T and G/C SNPs, corrects the flips. Also, this tool seems that it is not flipping all the SNPs based on the reference panel I provided. When merging with bcftools, I found some variants that were not flipped, while the variants were flipped in other samples.

It seems I am facing issues with the merging process. If I merge the common SNPs, I would lose a lot of SNPs, am I right? Unless I am missing something.

ADD REPLY • link 6 months ago by abedkurdi10 ▴ 190

0

Entering edit mode

I have not tried the Genotype Harmonizer yet. If the datasets were genotyped in the same array then you will not loose much SNPs while merging. But, if they are in different array, you will loose some. However, you can impute back the lost SNPs while imputing your data. I would first check how many common SNPs are in your data as follows-

awk '{print $2}' data1.bim |sort|uniq >1.list
awk '{print $2}' data2.bim |sort|uniq >2.list
awk '{print $2}' data3.bim |sort|uniq >3.list
awk '{print $2}' data4.bim |sort|uniq >4.list

comm -12 <(sort 1.list) <(sort 2.list) | comm -12 - <(sort 3.list) | comm -12 - <(sort 4.list) >common_SNPs.list
wc -l common_SNPs.list

ADD REPLY • link 6 months ago by bk11 ★ 2.4k

0

Entering edit mode

The datasets were not genotyped in the same array. In common, I got around ~207000 variants, while for each dataset I have:

Dataset 1 492592

Dataset 2 324282

Dataset 3 291611

Dataset 4 398343

Dataset 5 518396

Dataset 6 387083

Dataset 7 551603

Dataset 8 532975

ADD REPLY • link 6 months ago by abedkurdi10 ▴ 190

1

Entering edit mode

Now you are saying 8 datasets (earlier in your post were 4). The lowest number of SNPs (n=291611) is in your dataset 3. Seeing this the number of common SNPs (~207000) you have among these datasets is not bad. I would suggest you to go with 3 different approaches for your data-

Common SNPs approach: You can find common SNPs and merge all the 8 datasets. You will get back your lost SNPs in imputation process. The drawback of this approach is you will be loosing some genotyped SNPs. However, you still have enough SNPs for principal component analysis.
Impute each dataset separately and merge: If you have good proportion of cases and controls in your datasets, you can QC them and impute them separately. Post imputation again you can find common SNPs and merge them. In this way your will be able to find a lot of common SNPs between the datasets. Post imputation you can perform association analysis using combined datasets.
Meta-analysis: You can QC each dataset, impute them separately and perform meta-analysis.

You can check each of these approaches and the choose the best one that works for your data or goal.

ADD REPLY • link 6 months ago by bk11 ★ 2.4k

0

Entering edit mode

Thank you very much for your suggestions! Yeah, it was my mistake to say four! Thanks again!!

ADD REPLY • link 6 months ago by abedkurdi10 ▴ 190