I have a large PED file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz) and a large MAP file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz) that I am trying to work with.
The slightly modified map file I have has >600,000 rows
1 | rs3094315 | 0 | 742429 |
1 | rs7419119 | 0 | 831876 |
1 | rs13302957 | 0 | 880884 |
1 | rs6696609 | 0 | 893289 |
1 | rs8997 | 0 | 939517 |
1 | rs9442372 | 0 | 1008567 |
1 | rs147606383 | 0 | 1035194 |
1 | rs4970405 | 0 | 1038818 |
1 | rs11807848 | 0 | 1051029 |
1 | rs4970421 | 0 | 1098500 |
1 | rs1320571 | 0 | 1110294 |
1 | rs2887286 | 0 | 1145994 |
1 | rs79118541 | 0 | 1147410 |
1 | rs3813199 | 0 | 1148140 |
1 | rs113791678 | 0 | 1151643 |
1 | rs78424188 | 0 | 1160450 |
1 | rs12073590 | 0 | 1195018 |
1 | rs6685064 | 0 | 1201155 |
1 | rs61559999 | 0 | 1225655 |
1 | rs60785581 | 0 | 1225708 |
The slightly modified ped has 942 rows, where each row is an individual and each column is a genotype correspond to the associated row of map file
HGDP00001 | HGDP00001 | 0 | 0 | 1 | 0 | AG | GT | AA | CC | GG | AG | GG | AA |
HGDP00003 | HGDP00003 | 0 | 0 | 1 | 0 | AA | GT | AA | TT | GG | GG | GG | AA |
HGDP00005 | HGDP00005 | 0 | 0 | 1 | 0 | AA | TT | AA | CC | GG | GG | GG | AA |
HGDP00007 | HGDP00007 | 0 | 0 | 1 | 0 | AA | TT | AA | CC | GG | GG | GG | AA |
HGDP00011 | HGDP00011 | 0 | 0 | 1 | 0 | AG | GT | AA | CT | GG | AG | GG | AA |
HGDP00013 | HGDP00013 | 0 | 0 | 1 | 0 | AG | TT | AA | CC | AG | AG | GG | AA |
HGDP00015 | HGDP00015 | 0 | 0 | 1 | 0 | AG | GT | AA | CT | AG | GG | GG | AA |
HGDP00017 | HGDP00017 | 0 | 0 | 1 | 0 | AG | GT | AA | CC | GG | GG | GG | AA |
HGDP00019 | HGDP00019 | 0 | 0 | 1 | 0 | AA | TT | AG | CT | GG | GG | GG | AA |
HGDP00021 | HGDP00021 | 0 | 0 | 1 | 0 | AA | GT | AA | TT | GG | AA | GG | AA |
HGDP00023 | HGDP00023 | 0 | 0 | 1 | 0 | AA | GT | AA | TT | GG | AG | GG | AA |
HGDP00025 | HGDP00025 | 0 | 0 | 1 | 0 | AA | GT | AA | CT | GG | AG | GG | AA |
HGDP00027 | HGDP00027 | 0 | 0 | 1 | 0 | AG | GT | AA | CT | AG | AG | GG | AA |
HGDP00029 | HGDP00029 | 0 | 0 | 1 | 0 | AA | TT | AA | CC | GG | GG | GG | AA |
HGDP00031 | HGDP00031 | 0 | 0 | 1 | 0 | AG | GT | AG | CT | GG | GG | GG | AA |
HGDP00033 | HGDP00033 | 0 | 0 | 1 | 0 | AA | TT | AG | CC | GG | AA | GG | AG |
HGDP00035 | HGDP00035 | 0 | 0 | 1 | 0 | AG | TT | AG | CT | GG | AG | GG | AA |
HGDP00037 | HGDP00037 | 0 | 0 | 1 | 0 | AG | GT | AA | TT | AG | GG | GG | AG |
I trying to get them into a single table with a similar formatting to some Affy array data I have (which has rows as SNP ids and columns as individuals).
I was wondering if anyone could help figure out a Python or Bash scripting solution to transpose the ped file such that the 1st row of the ped file becomes the 5th column of the map file, and the 2nd row of the ped becomes the 6th of the map file, and so on...
Basically, I want it to look like this (I presume subsequently taking out the 0/1 rows and location columns will be fairly simple?)
HGDP00001 | HGDP00003 | HGDP00005 | HGDP00007 | HGDP00011 | HGDP00013 | HGDP00015 | HGDP00017 | HGDP00019 | HGDP00021 | HGDP00023 | HGDP00025 | HGDP00027 | HGDP00029 | HGDP00031 | HGDP00033 | HGDP00035 | HGDP00037 | HGDP00039 | ||||
HGDP00001 | HGDP00003 | HGDP00005 | HGDP00007 | HGDP00011 | HGDP00013 | HGDP00015 | HGDP00017 | HGDP00019 | HGDP00021 | HGDP00023 | HGDP00025 | HGDP00027 | HGDP00029 | HGDP00031 | HGDP00033 | HGDP00035 | HGDP00037 | HGDP00039 | ||||
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ||||
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||||
1 | rs3094315 | 0 | 742429 | AG | AA | AA | AA | AG | AG | AG | AG | AA | AA | AA | AA | AG | AA | AG | AA | AG | AG | AA |
1 | rs7419119 | 0 | 831876 | GT | GT | TT | TT | GT | TT | GT | GT | TT | GT | GT | GT | GT | TT | GT | TT | TT | GT | TT |
1 | rs13302957 | 0 | 880884 | AA | AA | AA | AA | AA | AA | AA | AA | AG | AA | AA | AA | AA | AA | AG | AG | AG | AA | AA |
1 | rs6696609 | 0 | 893289 | CC | TT | CC | CC | CT | CC | CT | CC | CT | TT | TT | CT | CT | CC | CT | CC | CT | TT | CT |
1 | rs8997 | 0 | 939517 | GG | GG | GG | GG | GG | AG | AG | GG | GG | GG | GG | GG | AG | GG | GG | GG | GG | AG | GG |
1 | rs9442372 | 0 | 1008567 | AG | GG | GG | GG | AG | AG | GG | GG | GG | AA | AG | AG | AG | GG | GG | AA | AG | GG | GG |
1 | rs147606383 | 0 | 1035194 | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG | GG |
1 | rs4970405 | 0 | 1038818 | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AA | AG | AA | AG | AA |
R would make this super easy, no? Does the solution *have* to be in bash/python?
Doesn't have to be, it is just that I have more experience with bash/python. I do have some background in R