Question: Transform Genomic Data
gravatar for admamajdi
2.2 years ago by
admamajdi0 wrote:


I have a genomic data file with this format:

1=A, 2=C, 3=G, 0=missing (No "T's")

How should I transform this data into SNP (0 1 2 and 5=missing) data?



snp • 856 views
ADD COMMENTlink written 2.2 years ago by admamajdi0

Hey Adam, you have not provided enough information such that anyone can give a reliable answer.

What type of file is it?; Is it binary or plain text?; In what exact format is it? - you should paste an example of your data.

For direct conversions of plain text characters in bash, you can use the tr command after you've piped from cat, for example: cat MyData | tr [1234] [ATGC] converts 1/2/3/4 to A/T/G/C, respectively.

It looks like you want to convert your data into allelic numerical encoding, but you have not stated this specifically. For example,

  • major allele | major allele = 0
  • major allele | minor allele = 1
  • minor allele | minor allele = 2
  • Missing = 5

To do this, you need to know the minor allele (or whatever allele in question whose effects you are researching)

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k

Hi Kevin,

Thanks for your reply. Here is the format of my file; it is a plain text:


Yes, I want to convert the file into allelic numerical encoding.


ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by admamajdi0

Hey Adam, there is still some doubt about what exactly you wish to do...

What is the exact encoding that you wish to use?

  • 1 = A = 0
  • 2 = C = 1
  • 3 = G = 2
  • 0 = missing = 5

If this is the correct encoding, then use cat MyData | tr [1230] [0125]

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k

Kevin, sorry my question was not clear enough. Actually, this is my first time see such data.

The coding I have is that (for example first row):

11331133113 = AAGGGGAAGGAAGG. In other words, 11 = AA; 33 = GG and so on ...

So, I want to know how I should transform this data into the allelic numerical encoding (0 1 2 and 5 for missing)

ADD REPLYlink written 2.2 years ago by admamajdi0


It would still help to know the following:

  • what is the source of the data (from where did you obtain it)?

Your indication is that it's genotyping data (like data obtained from PLINK), where every two bases are paired, but the column numbers are not even and therefore it cannot be genotyping data. Genotyping data would be like this:

A A   T T   G T   C A   T T

A T   T A   G T   C C   A A

If the reference alleles at these positions were A, T, G, C, and A, respectively, then I would encode them as:

0      0     1     1     2
1      1     1     0     0

[counting non-reference bases]

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k


The columns are even in the data file. What I had included here is just a sample (as an example). Yes, the data is genotype data for dairy cattle. I think it is a PLINK format as you mentioned.


ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by admamajdi0

Hey again Adam,

Thanks for providing the information - I think that we're getting somewhere.

So, it looks like your data was produced from PLINK using the --allele1234 parameter, which encodes as A=1, C=2, G=3, T=4, as you've mentioned.

If you want to convert this to the 012 format where the numbers relate to the number of minor alleles, then you just need to use the --recodeA parameter. See the original PLINK documentation hosted on Brigham & Women's web-domain, here: (search for '--recodeA'). Also take a look at --recodeAD

I'm going to assume that you're going to come back to say that you don't have access to PLINK or the original PLINK files, in which case you will have to calculate the minor allele manually for each SNP, and then convert it to 012 manually, too. I could do this for you quite easily, but I would need access to all of your data.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k

admamajdi : Please don't delete threads once they have received a comment or answer. If a particular comment has helped address your question point it out so we can promote it to an answer so the thread can receive closure.

ADD REPLYlink modified 20 months ago • written 20 months ago by genomax75k

Hello admamajdi,

Did you delete this post? If you did, could you please give us a reason why you chose to delete it?

Thank you!

ADD REPLYlink written 20 months ago by RamRS25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1288 users visited in the last hour