Plink-handling multiallelic variant to merge two datasets
1
0
Entering edit mode
4.8 years ago
khn ▴ 130

Hello,

I am merging two datasets. I used "--flip" and could merge almost all chromosomes except for some chromosomes. One of the examples is as below.

Bim file in my dataset1
1   rs2169610   0   103099201   T   A

Bim file in my dataset2
1   rs2169610   0   103326613   C   A


So I think that this is an error due to multiallelic variant. The error message said that PLINK is not yet suited to handling them. So anybody know how to handle this??? Thank you in advance!!!

Error: 1 variant with 3+ alleles present.

* If you believe this is due to strand inconsistency, try --flip with
Myfile_Chr1-merge.missnp.
(Warning: if this seems to work, strand errors involving SNPs with A/T or C/G
alleles probably remain in your data.  If LD between nearby SNPs is high,
--flip-scan should detect them.)
* If you are dealing with genuine multiallelic variants, we recommend exporting
that subset of the data to VCF (via e.g. '--recode vcf'), merging with
another tool/script, and then importing the result; PLINK is not yet suited
to handling them.

SNP genome gene snp • 10k views
2
Entering edit mode
4.8 years ago

From where did you obtain this data? The entry in your dataset1 is actually erroneous because the true position for rs2169610 is chr1:103326613 (in GRCh37 / hg19). So, this is not a typical multi-allelic call as plink suggests; rather, it's alluding to a potential error in the annotation.

Sometimes dbSNP entries can refer to more than 1 position and this is due to the fact that they are being referenced to more than one genome assembly, and this can be independent of the differences that one would expect between the official GRCh builds. For example, rs1234 could reference the same allele but have 2 different positions: one on the official GRCh37 assembly and the other on the Celera assembly.

You should read through the NCBI's FAQ on all of the 'intricacies' that can arise with the dbSNP annotation: Why does rsXYZ have two different chromosome positions?

As to what your options are with regard to this, I would consider excluding the variant at position chr1:103099201 or re-encoding it with an ID of chr1:103099201. To completely avoid all of these issues in the future, you can encode all of your variant IDs as CHR:POS:REF:ALT in order to ensure uniqueness.

Kevin

0
Entering edit mode

Hello Kevin,

Thank you so much in detal. I really appreciate it!

1. Excluding the variant at position chr1:103099201... If I choose this option, I think I can just use "--exclude Myfile.missnp".

2. Re-encoding it with an ID of chr1:103099201... If I want to choose this, can I just do it manually? And considering this as below, https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2169610

How can I adjust the alleles? One is T and A, and the other is C and A.

Bim file in my dataset1 1 rs2169610 0 103326613 T A

Bim file in my dataset2 1 rs2169610 0 103326613 C A

KHN

1
Entering edit mode

I think that you should first aim to understand the background to this particular rs ID. Did you get this data from a particular genotyping array? Perhaps the fault is with the manufacturer of the array.

To exclude this variant, because it has the same ID in 2 different locations, you would have to use --extract myrange.txt --range. See the section entitled 'Extract a subset of SNPs: file-list options' in Data management tools

According to the dbSNP, the position has 3 possible genotypes on the forward strand, i.e., A, G, or T. This would imply that the genotype at 'dataset2 1 rs2169610 0 103326613 C A' is referring to the reverse strand, and would need to be flipped. I don't know about the dataset1 variant because I don't know if it's an error or not.

0
Entering edit mode

Thank you so much again. I would like to explain more in detail of what I did.

Initially when I first merged the two datasets, it said as below. For chromosome 1,

Random number seed: 1526805743 39453 MB RAM detected; reserving 19726 MB for main workspace.

5 people to be merged from Mydataset2.fam.

Of these, 5 are new, while 0 are present in the base dataset.

Warning: Multiple positions seen for variant 'rs**'.

Warning: Multiple positions seen for variant 'rs2169610'.

Warning: Multiple positions seen for variant 'rs**'.

Warning: Multiple positions seen for variant 'rs**'.

.......

Warning: Multiple positions seen for variant 'rs**'.

45628 markers to be merged from Mydataset2.bim.

Of these, 7036 are new, while 38592 are present in the base dataset.

Error: 18935 variants with 3+ alleles present. * If you believe this is due to strand inconsistency, try --flip with Merged-dataset1-2-merge.missnp. (Warning: if this seems to work, strand errors involving SNPs with A/T or C/G alleles probably remain in your data. If LD between nearby SNPs is high, --flip-scan should detect them.) * If you are dealing with genuine multiallelic variants, we recommend exporting that subset of the data to VCF (via e.g. '--recode vcf'), merging with another tool/script, and then importing the result; PLINK is not yet suited to handling them.

So, I flipped the Mydataset1, and then merged. At that time, for some of the chromosomes, including chromosome 1, created the missnp files, the error message of which was in my first post.

Thus, I think that I should not have flipped the Mydataset1 initally, but I needed to adjust the positions first.

Though there were 18935 variants had multiple positions in chromosome 1 before flipping, but after flipping the Mydataset1, I could merge all for chromosome 1, except for rs2169610.

I think I needed to map the positions first.

0
Entering edit mode

I eventually got back. Please take a look at my comment to Phoenix Mu

0
Entering edit mode

I am actually having the same problem. But, I am trying to merge files from different chromosomes, i.e. chr1-22. Therefore, there should be no duplicated markers between these files, but I still get this error. How could that be?

0
Entering edit mode

I later found that removing these problematic SNPs with BCFtools (prior to the conversion to PLINK format) is better. If you take a look at what I do in Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2