I have a genomic data file with this format:
1=A, 2=C, 3=G, 0=missing (No "T's")
How should I transform this data into SNP (0 1 2 and 5=missing) data?
Hey Adam, you have not provided enough information such that anyone can give a reliable answer.
What type of file is it?; Is it binary or plain text?; In what exact format is it? - you should paste an example of your data.
For direct conversions of plain text characters in bash, you can use the tr command after you've piped from cat, for example:
cat MyData | tr  [ATGC] converts 1/2/3/4 to A/T/G/C, respectively.
cat MyData | tr  [ATGC]
It looks like you want to convert your data into allelic numerical encoding, but you have not stated this specifically. For example,
To do this, you need to know the minor allele (or whatever allele in question whose effects you are researching)
Thanks for your reply. Here is the format of my file; it is a plain text:
Yes, I want to convert the file into allelic numerical encoding.
Hey Adam, there is still some doubt about what exactly you wish to do...
What is the exact encoding that you wish to use?
If this is the correct encoding, then use cat MyData | tr  
cat MyData | tr  
Kevin, sorry my question was not clear enough. Actually, this is my first time see such data.
The coding I have is that (for example first row):
11331133113 = AAGGGGAAGGAAGG. In other words, 11 = AA; 33 = GG and so on ...
So, I want to know how I should transform this data into the allelic numerical encoding (0 1 2 and 5 for missing)
It would still help to know the following:
Your indication is that it's genotyping data (like data obtained from PLINK), where every two bases are paired, but the column numbers are not even and therefore it cannot be genotyping data. Genotyping data would be like this:
A A T T G T C A T T
A T T A G T C C A A
If the reference alleles at these positions were A, T, G, C, and A, respectively, then I would encode them as:
0 0 1 1 2
1 1 1 0 0
[counting non-reference bases]
The columns are even in the data file. What I had included here is just a sample (as an example). Yes, the data is genotype data for dairy cattle. I think it is a PLINK format as you mentioned.
Hey again Adam,
Thanks for providing the information - I think that we're getting somewhere.
So, it looks like your data was produced from PLINK using the --allele1234 parameter, which encodes as A=1, C=2, G=3, T=4, as you've mentioned.
If you want to convert this to the 012 format where the numbers relate to the number of minor alleles, then you just need to use the --recodeA parameter. See the original PLINK documentation hosted on Brigham & Women's web-domain, here: http://zzz.bwh.harvard.edu/plink/dataman.shtml (search for '--recodeA'). Also take a look at --recodeAD
I'm going to assume that you're going to come back to say that you don't have access to PLINK or the original PLINK files, in which case you will have to calculate the minor allele manually for each SNP, and then convert it to 012 manually, too. I could do this for you quite easily, but I would need access to all of your data.
admamajdi : Please don't delete threads once they have received a comment or answer. If a particular comment has helped address your question point it out so we can promote it to an answer so the thread can receive closure.
Did you delete this post? If you did, could you please give us a reason why you chose to delete it?