Question: How to convert plink.raw to map and ped files in any software
0
gravatar for bha
12 days ago by
bha0
bha0 wrote:

I have a data files contains imputed genotypes with informations in the columns as:

FamilyID  IndividualID  DAD MUM SEX Phenotype snp1 snp 2 .........snp60k

this file is in plink.raw format, any idea how i can convert this in .map and ped plink format?

sequencing snp plink R • 256 views
ADD COMMENTlink modified 11 days ago by Kevin Blighe7.2k • written 12 days ago by bha0

From where did you get the raw file? These raw files are usually produced via the recode command line parameter to plink, meaning that they have to have been produced from an existing plink dataset. This question has been posted time and time again across the WWW but there's never a concrete solution, and I keep wondering from where did the person obtain the raw file in the first place(?).

If you have those columns, it is already virtually a PED file, but just one without the associated MAP. How are your alleles encoded? Are you sure that you don't already have a PED or BED file in your directory?

ADD REPLYlink modified 12 days ago • written 12 days ago by Kevin Blighe7.2k

Kevin, many thanks for replying. Yes, I got plink.raw file from recode in plink. Actually I used (plink.raw after removing FID dad,mum,sex and pheno) file for imputing missing genotypes in an other programme called AlphaImpute. The output which got after imputation is in text file (which contain columns; ID , snp1, snp2,...snp60k). I merged that out after imputation (cbind in R), and get DAD, MUM,SEX and Pheno. Now i want to convert this file back to plink. this how it looks

FID IID PAT MAT SEX Pheno snp1 snp2 snp3 snp4 snp5
1  117  12  51  1  -9  0  2  0  0   9
1  112  11  78  1  -9  1  1  0  9   0
1  109  79  19  2  -9  0  0  2  0   1
1  108  20  15  1  -9  2  1  1  9   2
1  107  32  17  1  -9  1  0  2  9   9
1   98  44  71  2  -9  2  2  2  0   9

he snps coded as 0, 1, and 2 stand for the homozygous aa, the heterozygous aA or Aa, and the homozygous AA cases, respectively, and 9 is missing genotype. I also posted in stackoverflow, but haven't got any answer so far.

ADD REPLYlink modified 12 days ago • written 12 days ago by bha0

In that case, you may still be able to coerce this back into a PLINK object provided the original map file is there (if you have it?). Take a look here: https://www.cog-genomics.org/plink/1.9/input#ped

Something like: plink --ped MyImputedData.txt --map MyOriginalMap.map (or possibly --tped)

I have also added the 'plink' tag to your original question in the hope that the developer will pick up on this issue.

Generally it looks like there's nothing specific to do this, and that you'll have to work with the data you've got in order t coerce it back into PLINK format. Also double check your impute program to see if it doesn't already have a function to produce data in PLINK-ready format.

ADD REPLYlink written 11 days ago by Kevin Blighe7.2k

I do have map file, I give a go with this

plink --ped MyImputedData.txt --map MyOriginalMap.map

but ends up with error "token does not match". As I mentioned earlier, I got plink.raw (MyImputedData.txt ) from --recodeA (which convert alleles as 0, 1 and 2). So I think, plink expect 2 columns for one allele. Do you know, is there any way to re-convert single column into two columns? I am afraid imputed program does not produce PLINK-ready format

ADD REPLYlink written 10 days ago by bha0

Can you paste a few lines from each here (the raw and map file), for matching SNPs? Then I will try to re-create the PLINK dataset here and update you afterwards.

ADD REPLYlink written 10 days ago by Kevin Blighe7.2k

first 10 rows and columns plink.raw file:

FID IID PAT MAT SEX PHENOTYPE   snp1    snp2    snp3    snp4
1   6   4747    0   1   -9  1   1   1   1
1   7   3211    0   2   -9  0   1   0   0
1   8   4561    0   2   -9  0   0   0   0
1   11  4745    0   1   -9  0   2   0   2
1   12  4551    0   1   -9  2   2   0   NA
1   13  4538    0   1   -9  NA  0   1   0
1   14  10880   0   1   -9  2   0   2   0
1   15  11016   0   1   -9  0   NA  2   0
1   18  4348    0   1   -9  NA  1   0   0
1   20  5695    0   1   -9  0   2   1   2

And the map file:

21  oar3_OAR21_19607    0   19607
21  oar3_OAR21_22363    0   22363
21  oar3_OAR21_28966    0   28966
21  oar3_OAR21_29561    0   29561
21  oar3_OAR21_33633    0   33633
21  oar3_OAR21_38408    0   38408
21  oar3_OAR21_50220    0   50220
21  oar3_OAR21_57475    0   57475
21  oar3_OAR21_62772    0   62772
21  oar3_OAR21_64887    0   64887
21  oar3_OAR21_66447    0   66447
21  oar3_OAR21_74279    0   74279
21  oar3_OAR21_88952    0   88952
21  oar3_OAR21_99105    0   99105

Many thanks, in advance!

ADD REPLYlink written 10 days ago by bha0

I did some testing and it is possible to convert these back to PLINK, but you are still missing key information. Yes, PLINK expects two columns for each SNP genotype, and, if numbers are provided as genotypes in the PED/RAW file, it will assume that these are in the 1234 format for ACGT.

I was able to coerce your data to PLINK with:

sed 's/ \+/ /g' test.raw > test.ped

Then remove the header in test.ped and only include 4 columns for genotypes (for 2 SNPs). I also modified the missing genotypes that were NA, and other genotypes, just or testing.

cat test.ped 
1 6 4747 0 1 -9 1 1 1 1
1 7 3211 0 2 -9 2 1 3 1
1 8 4561 0 2 -9 2 1 1 2
1 11 4745 0 1 -9 2 2 2 2
1 12 4551 0 1 -9 2 2 2 1
1 13 4538 0 1 -9 2 1 1 4
1 14 10880 0 1 -9 2 2 2 3
1 15 11016 0 1 -9 2 2 2 3
1 18 4348 0 1 -9 2 1 2 1
1 20 5695 0 1 -9 2 2 1 2

cat test.map 
21  oar3_OAR21_19607    0   19607
21  oar3_OAR21_22363    0   22363

plink --ped test.ped --map test.map 
PLINK v1.90b3.38 64-bit (7 Jun 2016)       https://www.cog-genomics.org/plink2
(C) 2005-2016 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink.log.
Options in effect:
  --map test.map
  --ped test.ped

15037 MB RAM detected; reserving 7518 MB for main workspace.
.ped scan complete (for binary autoconversion).
Warning: Variant 2 quadallelic; setting rarest alleles missing.
Performing single-pass .bed write (2 variants, 10 people).
--file: plink.bed + plink.bim + plink.fam written.

The conclusion? We need another mapping in order to connect 012 back to bases, and convert these (using awk, possibly) to ACGT or 1234 in the PED file. Do you have such mapping from the original data?

The order of genotypes in the MAP files has to agree with the order in the RAW/PED file too.

Almost there.

ADD REPLYlink modified 10 days ago • written 10 days ago by Kevin Blighe7.2k

Thanks a lot for great help. I do have original data for mapping. your methods is also works, however, i tried something else which works more swiftly. As, PLINK wants two columns for each SNP, so I use this:

library(HapEstXXR)

Markers2 <- allele1to2(d_dimputed[, 7:61865])

which converts each column into two alleles, and this should be .ped file. And map is already there. Do you think it make sense?

ADD REPLYlink modified 8 days ago by Kevin Blighe7.2k • written 9 days ago by bha0

I have never used this program. Are you sure that it's output will be accepted by PLINK?

Type 1 : 1-column genotype matrix : minor allele count (0,1,2)

Type 2 : 2-column genotype matrix : each marker has a pair of two columns (1/1, 1/2, 2/2)

Type R : 1-column genotype matrix : code (1 = 1/1, 3 = 1/2, 2 = 2/2)

[source: https://cran.r-project.org/web/packages/HapEstXXR/HapEstXXR.pdf]

If you have the original genotype at each SNP (ATGC), then I can probably convert your imputed data to PLINK format using awk and sed.

ADD REPLYlink written 8 days ago by Kevin Blighe7.2k

I tried with my data with (because my plink.raw is in 0,1,2):

Type 1 : 1-column genotype matrix : minor allele count (0,1,2)

and the output got accepted by PLINK. I have original genotype at each SNP (ATGC) - Just for learning; could you please write an example code, how to convert imputed data to PLINK format using awk and sed?

ADD REPLYlink written 8 days ago by bha0

I would need the lookup tables, i.e., for linking the minor allele numbers in 012 format to the actual alleles, in either 1234 or ACGT.

Can you paste the output of plink when you input this information? Are you sure that it has interpreted it correctly?

ADD REPLYlink written 8 days ago by Kevin Blighe7.2k

I got this ped file using above R package (FID, IID, PAT,MAT,SEX,PHENO, and snps);

1   -6  4747    0   1   -9  1   1   2   2   1   1
1   -7  3211    0   2   -9  0   0   0   0   0   0
1   -8  4561    0   2   -9  1   1   2   1   1   1
1   -11 4745    0   1   -9  1   1   2   2   1   1
1   -12 4551    0   1   -9  0   0   0   0   0   0
1   -13 4538    0   1   -9  1   1   1   1   1   1
1   -14 10880   0   1   -9  2   2   1   1   2   2
1   -15 11016   0   1   -9  1   1   1   1   1   1
1   -18 4348    0   1   -9  2   2   1   1   2   2
1   -20 5695    0   1   -9  1   1   1   1   2   2
1   -21 4329    0   1   -9  1   1   1   1   1   1
1   -22 11004   0   2   -9  1   1   1   1   1   1
1   -25 2709    0   2   -9  2   1   2   1   2   1
1   -26 61  0   1   -9  1   1   2   1   1   1
1   -27 4205    0   1   -9  2   1   2   1   2   1
1   -31 4556    0   1   -9  2   1   2   1   2   1
1   -32 4539    0   2   -9  1   1   2   2   1   1
1   -35 2971    0   1   -9  2   1   2   1   2   1

Well, to me it looks good, but can't be 100% sure.

ADD REPLYlink written 8 days ago by bha0

I think that you still need to be careful here because plink will still assume that each 2 columns relates to a single genotype.

For example, the first line:

1   -6  4747    0   1   -9  1   1   2   2   1   1

Plink will assume that the genotypes are 11, 22, and 11 (AA, CC, and TT).

As mentioned, you will have to link the 012 imputed raw file back to your original MAP file, and then encoding 012 and 1234 or ACTG.

If in doubt, I would contact the developers of plink (assuming their email is on the website). I believe that it's now under the license/maintenance of COG Genomics.

ADD REPLYlink written 8 days ago by Kevin Blighe7.2k

Do you meant to use this:

plink --ped plink.raw --map plink.map --out newData

plink.raw is the file after converting 012 to 11, 22, and 11 (AA,CC,T)?

ADD REPLYlink written 8 days ago by bha0

No, I'm talking about editing your raw file manually, using something like awk.

Currently it's in 012 format (and also the format from allele1to2), which contains no information on genotype calls. Plink cannot read this format directly (it outputs to this format using --recode, as we know).

We want 1234 or ACTG format. For that, we need to know the actual genotype at each SNP position in your data. We lost this connection by using the --recode command.

plink.raw is the file after converting 012 to 11, 22, and 11 (AA,CC,T)?

As much as I'm aware, 11, 22, and 11 do not equal AA, CC, and T.


If you compare your original and current data:

Original (from imputation; 012 encoding):

1   6   4747    0   1   -9  1   1   1

Current (from allele1to2):

1   -6  4747    0   1   -9  1   1   2   2   1   1

Looking at the original, I see 3 heterozygous SNPs. Looking at the current file, I see conflicting information, as the 11, 22, and 11, indicate homozygosity for AA, CC, AA.


Just be careful about how you proceed with this. Genotype data can be very messy and good data management is therefore of paramount importance. I don't believe that allele1to2 is what you want to use. You need THIS. You may already have all required information.

ADD REPLYlink written 8 days ago by Kevin Blighe7.2k

Yes, you are absolutely right. In original file, i have 3 heterozygous SNPs,but in current file (plink.raw 012) -this conflicting with original. I can't see 'G'.

ADD REPLYlink written 8 days ago by bha0

I got two files 1) plink.raw (imputed genotypes as 012 format), and an other original map file say 2) plink.map file (which contain info for CHR, SNPIdentifier genetic distance and bpPOS). this is how i am doing;

> famid <- FID patid 
<- IID fid <- PAT
 mid <- MAT
 sex <- SEX
 trait <-PHENOTYPE 
CHR <- plink.map$CHR
 SNP <- plink.map$SNPIdentifier
 POS <- plink.map$bpPOS

geno.matrix <- plink.raw (after removing first column)????

BUT, where i can get these 3 files? linkage.file, map.file,cov.file

geno.matrix (n,m) genotype matrix (n=number of individuals, m=number of marker, 1-column for every marker, R-code: 1 = 1/1, 3 = 1/2, 2 = 2/2); All markers should be biallelic.

BUT mine is in 0,1,2, should i change to 1,2,3?

ADD REPLYlink modified 8 days ago • written 8 days ago by bha0

You may have them somewhere on your disk. Essentially, all you need is a list of SNPs that has, for each, the minor and major allele. Sometimes these are output as standard if your data has passed through a bioinformatics service. I output it too when analysing genotype data:

ID    Minor Major
SNP1  A     G
SNP2  T     G
....

I've just read somewhere else that, yes, it's not possible to revert back to what we need:

Some association or linkage software requires that alleles be coded as nuemric values, therefore one may want to convert ACGT alleles into 1/2 alleles. Softwares such as PLINK can perform this task, using --recode12 argument. However, the 1/2 coding by PLINK cannot be decoded back to ACGT, since 1 and 2 denote minor/major allele, respectively, and these annotations differ between data sets. When getting a genotype data file with 1/2 allele coding from other people, it is always necessary to ask whether the alleles are coded as Illumina's A/B alleles or simply recoded using PLINK; if it's the latter, it is best to ask for original genotype data before doing anything with the data.

[source: http://gengen.openbioinformatics.org/en/latest/tutorial/coding/]


As some final guidance to you:

  • Post the issue on the PLINK Google Groups page, which is where they recommend directing issues with PLINK
  • It is still possible to perform the association testing with the data that you've got. You can run a binomial logistic regression model that predicts for your outcome of interest and treat 0, 1, and 2 as categorical variables for Reference, Heterozygous, and Homozygous respectively. I have done this genome-wide from PLINK data for a conditional logistic regression model with interaction terms on trios. Take a look here: R functions edited for parallel processing and also my GitHub page (go to my Biostars profile).
  • If you can still find the information that connects 012 back to actual genotypes, then I can still help here.
ADD REPLYlink written 8 days ago by Kevin Blighe7.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1359 users visited in the last hour