I have three VCFs. Two of these VCFs were generated using the Precision Medicine Research Array (PMRA) and refer to SNPs as AX numbers. I was able to merge the two PMRA VCFs together.
Merged PMRA VCFs (Total genotyping rate is 0.924427):
1 AX-150343089 0 837711 T C
1 AX-149471710 0 837756 T G
1 AX-40234919 0 844647 G T
1 AX-114086366 0 846320 A G
However my Global Screening Array (GSA) VCF uses rs ids:
1 rs6605059 0 984039 T C
1 rs4970414 0 984121 G T
1 rs116781904 0 984475 A G
1 GSA-rs61770779 0 984547 A G
I was able to merge using bcftools merge pmra.vcf.gz gsa.vcf.gz -o gsa_pmra.vcf --force-samples
gsa_pmra.vcf (Total genotyping rate is 0.534432)
1 AX-29796323 0 1117607 C G
1 GSA-rs61766344 0 1118711 T C
1 AX-29797251 0 1120162 A G
1 AX-29797373 0 1120521 A C
1 AX-38925889;rs9442373 0 1127258 C A
1 AX-29801021;rs4072537 0 1129916 T C
1 AX-29801231;GSA-rs11260598 0 1130346 C T
1 AX-29801717 0 1131207 T C
1 AX-107792172 0 1133289 GT G
1 AX-107792172 0 1133290 G TT
1 rs61766346 0 1133503 A G
1 AX-29803231 0 1134155 A G
I have 4681 samples using PMRA and 40 samples using GSA
Total Positions: 944562 (PMRA), 692246 (GSA)
- PMRA Only: 805321
- GSA only: 553005
- Overlap: 139241
My issue is that the genotyping rate (0.534432) is low and I am afraid that when I do any sort of quality control filtering, it will remove too many SNPs/samples. Does anyone have any advice/comments?
Are your all genotyping data in the same build? The SNPs ids with AX numbers or rsid is not a problem here. You can replace those with chr:position and later annotate into rsid. I would recommend you to use
Plink
to quality control before you merge your datasets. You can follow the following steps provided that your two datasets are in the same build-Convert your vcf files into plink binary files-
plink --vcf pmra.vcf.gz --make-bed --out pmra
plink --vcf gsa.vcf.gz --make-bed --out gsa
Quality control your gsa.bed/bim/fam & pmra.bed/bim/fam files: I would subset data to chromosomes 1-23, get rid of AT,CG GC and TA SNPs and swap AX-number and rsids with chrom:position
Find common SNP and Merged your QCed data: I would find common SNPs between your gsa and pmra data, and merge them. While merging you should make sure that each alleles match for the SNPs.
Thanks for your reply. Yes the genotyping data is in the same build.
Trying to understand this: "get rid of AT,CG GC and TA SNPs". "While merging you should make sure that each alleles match for the SNPs", for this, do you mean that I should ensure that the ref and alt allele from GSA and PMRA SNPs match?