Need help transposing MAP file and PED file into one large table
2
1
Entering edit mode
8.9 years ago
devenvyas ▴ 740

I have a large PED file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz) and a large MAP file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz) that I am trying to work with.

The slightly modified map file I have has >600,000 rows

1   rs3094315     0   742429
1   rs7419119     0   831876
1   rs13302957    0   880884
1   rs6696609     0   893289
1   rs8997        0   939517
1   rs9442372     0   1008567
1   rs147606383   0   1035194
1   rs4970405     0   1038818
1   rs11807848    0   1051029
1   rs4970421     0   1098500
1   rs1320571     0   1110294
1   rs2887286     0   1145994
1   rs79118541    0   1147410
1   rs3813199     0   1148140
1   rs113791678   0   1151643
1   rs78424188    0   1160450
1   rs12073590    0   1195018
1   rs6685064     0   1201155
1   rs61559999    0   1225655
1   rs60785581    0   1225708

The slightly modified ped has 942 rows, where each row is an individual and each column is a genotype correspond to the associated row of map file

HGDP00001   HGDP00001   0   0   1   0   AG   GT   AA   CC   GG   AG   GG   AA
HGDP00003   HGDP00003   0   0   1   0   AA   GT   AA   TT   GG   GG   GG   AA
HGDP00005   HGDP00005   0   0   1   0   AA   TT   AA   CC   GG   GG   GG   AA
HGDP00007   HGDP00007   0   0   1   0   AA   TT   AA   CC   GG   GG   GG   AA
HGDP00011   HGDP00011   0   0   1   0   AG   GT   AA   CT   GG   AG   GG   AA
HGDP00013   HGDP00013   0   0   1   0   AG   TT   AA   CC   AG   AG   GG   AA
HGDP00015   HGDP00015   0   0   1   0   AG   GT   AA   CT   AG   GG   GG   AA
HGDP00017   HGDP00017   0   0   1   0   AG   GT   AA   CC   GG   GG   GG   AA
HGDP00019   HGDP00019   0   0   1   0   AA   TT   AG   CT   GG   GG   GG   AA
HGDP00021   HGDP00021   0   0   1   0   AA   GT   AA   TT   GG   AA   GG   AA
HGDP00023   HGDP00023   0   0   1   0   AA   GT   AA   TT   GG   AG   GG   AA
HGDP00025   HGDP00025   0   0   1   0   AA   GT   AA   CT   GG   AG   GG   AA
HGDP00027   HGDP00027   0   0   1   0   AG   GT   AA   CT   AG   AG   GG   AA
HGDP00029   HGDP00029   0   0   1   0   AA   TT   AA   CC   GG   GG   GG   AA
HGDP00031   HGDP00031   0   0   1   0   AG   GT   AG   CT   GG   GG   GG   AA
HGDP00033   HGDP00033   0   0   1   0   AA   TT   AG   CC   GG   AA   GG   AG
HGDP00035   HGDP00035   0   0   1   0   AG   TT   AG   CT   GG   AG   GG   AA
HGDP00037   HGDP00037   0   0   1   0   AG   GT   AA   TT   AG   GG   GG   AG

I trying to get them into a single table with a similar formatting to some Affy array data I have (which has rows as SNP ids and columns as individuals).

I was wondering if anyone could help figure out a Python or Bash scripting solution to transpose the ped file such that the 1st row of the ped file becomes the 5th column of the map file, and the 2nd row of the ped becomes the 6th of the map file, and so on...

Basically, I want it to look like this (I presume subsequently taking out the 0/1 rows and location columns will be fairly simple?)

                  HGDP00001   HGDP00003   HGDP00005   HGDP00007   HGDP00011   HGDP00013   HGDP00015   HGDP00017   HGDP00019   HGDP00021   HGDP00023   HGDP00025   HGDP00027   HGDP00029   HGDP00031   HGDP00033   HGDP00035   HGDP00037   HGDP00039
                  HGDP00001   HGDP00003   HGDP00005   HGDP00007   HGDP00011   HGDP00013   HGDP00015   HGDP00017   HGDP00019   HGDP00021   HGDP00023   HGDP00025   HGDP00027   HGDP00029   HGDP00031   HGDP00033   HGDP00035   HGDP00037   HGDP00039
                  0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
                  0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
                  1           1           1           1           1           1           1           1           1           1           1           1           1           1           1           1           1           1           1
                  0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
1   rs3094315     0           742429      AG          AA          AA          AA          AG          AG          AG          AG          AA          AA          AA          AA          AG          AA          AG          AA          AG         AG   AA
1   rs7419119     0           831876      GT          GT          TT          TT          GT          TT          GT          GT          TT          GT          GT          GT          GT          TT          GT          TT          TT         GT   TT
1   rs13302957    0           880884      AA          AA          AA          AA          AA          AA          AA          AA          AG          AA          AA          AA          AA          AA          AG          AG          AG         AA   AA
1   rs6696609     0           893289      CC          TT          CC          CC          CT          CC          CT          CC          CT          TT          TT          CT          CT          CC          CT          CC          CT         TT   CT
1   rs8997        0           939517      GG          GG          GG          GG          GG          AG          AG          GG          GG          GG          GG          GG          AG          GG          GG          GG          GG         AG   GG
1   rs9442372     0           1008567     AG          GG          GG          GG          AG          AG          GG          GG          GG          AA          AG          AG          AG          GG          GG          AA          AG         GG   GG
1   rs147606383   0           1035194     GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG         GG   GG
1   rs4970405     0           1038818     AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AG          AA         AG   AA
SNP Python Bash • 6.0k views
ADD COMMENT
0
Entering edit mode

R would make this super easy, no? Does the solution *have* to be in bash/python?

ADD REPLY
0
Entering edit mode

Doesn't have to be, it is just that I have more experience with bash/python. I do have some background in R

ADD REPLY
2
Entering edit mode
8.9 years ago
Ram 43k

Here's my (quick and dirty) solution in R:

##read files
map_file<-read.csv('./mapfile.csv')
ped_file<-read.csv('./pedfile.csv')

##transform and merge data

#create matrix of blanks
matrix_of_blanks<-matrix(data="",nrow=6,ncol=4)

#bind matrix of blanks to map file
tr_map<-rbind(matrix_of_blanks,as.matrix(map_file))

#combine transformed map file and transposed ped file
cbind(tr_map,t(ped_file))
ADD COMMENT
0
Entering edit mode

I ran this R code

setwd("/scratch/lfs/vyas/HGDPcomp")

##read files
map_file<-read.csv('all_snp_out.map')
ped_file<-read.csv('all_snp_nospace.ped')

##transform and merge data

#create matrix of blanks
matrix_of_blanks<-matrix(data="",nrow=6,ncol=4)

#bind matrix of blanks to map file
tr_map<-rbind(matrix_of_blanks,as.matrix(map_file))

#combine transformed map file and transposed ped file
cbind(tr_map,t(ped_file)) -> transposed
write.csv(transposed, file='table.csv')

and then I got this result back from the cluster:

Error in rbind(matrix_of_blanks, as.matrix(map_file)) :
  number of columns of matrices must match (see arg 2)
Execution halted

I think I will try to transpose the ped file first, and then manually transform the map file in excel

ADD REPLY
0
Entering edit mode
ped_file<-read.csv('all_snp_nospace.ped', sep = '\t')
t(ped_file) -> transposed
write.csv(transposed, sep = '\t', quote=FALSE, file='table.csv')

The R method is not working. It is writing with comma-delimiters and the first row is a whole bunch of Vs and the first column appears to merging some columns

This is a bit of my output... (copied out of Excel)

              V1          V2          V3          V4          V5          V6          V7          V8          V9          V10         V11         V12         V13         V14         V15
HGDP00001     HGDP00003   HGDP00005   HGDP00007   HGDP00011   HGDP00013   HGDP00015   HGDP00017   HGDP00019   HGDP00021   HGDP00023   HGDP00025   HGDP00027   HGDP00029   HGDP00031   HGDP00033
HGDP00001.1   HGDP00003   HGDP00005   HGDP00007   HGDP00011   HGDP00013   HGDP00015   HGDP00017   HGDP00019   HGDP00021   HGDP00023   HGDP00025   HGDP00027   HGDP00029   HGDP00031   HGDP00033
X0            0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
X0.1          0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
X1            1           1           1           1           1           1           1           1           1           1           1           1           1           1           1
X0.2          0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
AG            AA          AA          AA          AG          AG          AG          AG          AA          AA          AA          AA          AG          AA          AG          AA
GT            GT          TT          TT          GT          TT          GT          GT          TT          GT          GT          GT          GT          TT          GT          TT
AA            AA          AA          AA          AA          AA          AA          AA          AG          AA          AA          AA          AA          AA          AG          AG
CC            TT          CC          CC          CT          CC          CT          CC          CT          TT          TT          CT          CT          CC          CT          CC
GG            GG          GG          GG          GG          AG          AG          GG          GG          GG          GG          GG          AG          GG          GG          GG
AG.1          GG          GG          GG          AG          AG          GG          GG          GG          AA          AG          AG          AG          GG          GG          AA
GG.1          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG
AA.1          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AG
CT            TT          TT          CT          CT          TT          TT          CT          TT          CC          CT          CT          CT          TT          TT          CC
GG.2          GG          GG          GG          GG          AG          AG          GG          AG          GG          GG          GG          GG          GG          AG          GG
AG.2          GG          GG          AG          GG          AG          GG          GG          AG          GG          GG          GG          GG          GG          AG          GG
TT            TT          TT          CT          CT          CT          CC          CT          CT          TT          CT          TT          TT          CT          CC          CC
TT.1          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT          TT
GG.3          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG
CC.1          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC
CC.2          CC          CC          CC          CC          0           CC          CC          CC          CC          CC          CC          CC          CC          CC          CC
AA.2          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AA          AC          AA          AA          AA
CT.1          CC          CC          CC          CT          CT          CC          CC          CC          CC          CC          CC          CT          CT          CT          CC
CC.3          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC          CC
CC.4          CC          CC          0           CC          CC          CC          CC          CC          CC          CC          CC          0           CC          CC          CC
GG.4          GG          GG          GG          GG          GG          AA          GG          GG          GG          GG          GG          GG          GG          GG          GG
GG.5          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG          GG
TT.2          TT          TT          TT          TT          TT          CC          TT          TT          TT          TT          TT          CT          TT          TT          CT
TT.3          TT          TT          TT          TT          TT          CC          TT          TT          TT          TT          TT          TT          TT          TT          CT
GG.6          GG          GG          GG          GG          GG          AA          GG          GG          GG          GG          GG          GG          GG          GG          GG
AG.3          AG          AA          AG          AG          AG          GG          AG          AG          AA          AA          AA          AG          AG          AG          AG

I will try the Plink method later today.

ADD REPLY
0
Entering edit mode

write.csv will write csv. Not sure why you're writing csv here in the first place.

ADD REPLY
0
Entering edit mode

Please note that I converted map and ped files to CSV before reading them in. If you're reading them directly, you're better off using read.table(file,sep="\t",header=FALSE,stringsAsFactors=FALSE)

This might solve the unequal number of rows problem. Also, please run step by step so you know which rbind is erroring out.

ADD REPLY
0
Entering edit mode
ped_file<-read.table('all_snp_nospace.ped', sep="\t", header=FALSE, stringsAsFactors=FALSE)
t(ped_file) -> awesome
write.table(awesome, sep = '\t', quote=FALSE, file='table.csv')

Yielded this:

V1    V2    V3    V4    V5
V1    HGDP00001    HGDP00003    HGDP00005    HGDP00007
V2    HGDP00001    HGDP00003    HGDP00005    HGDP00007
V3    0    0    0    0
V4    0    0    0    0
V5    1    1    1    1
V6    0    0    0    0
V7    AG    AA    AA    AA
V8    GT    GT    TT    TT
V9    AA    AA    AA    AA
V10    CC    TT    CC    CC
V11    GG    GG    GG    GG
V12    AG    GG    GG    GG
V13    GG    GG    GG    GG
V14    AA    AA    AA    AA
V15    CT    TT    TT    CT
V16    GG    GG    GG    GG
V17    AG    GG    GG    AG
V18    TT    TT    TT    CT
V19    TT    TT    TT    TT
V20    GG    GG    GG    GG
V21    CC    CC    CC    CC
V22    CC    CC    CC    CC
V23    AA    AA    AA    AA
V24    CT    CC    CC    CC
V25    CC    CC    CC    CC
V26    CC    CC    CC    00
V27    GG    GG    GG    GG
V28    GG    GG    GG    GG
V29    TT    TT    TT    TT
V30    TT    TT    TT    TT
V31    GG    GG    GG    GG
V32    AG    AG    AA    AG
V33    AA    AA    AA    AA
V34    TT    TT    TT    TT
V35    CC    CC    CC    CC
V36    AA    AA    GG    AG
V37    AG    AG    GG    GG
V38    CT    CT    TT    CT
V39    AG    AG    GG    GG
V40    CC    CC    CT    CT
V41    GG    GG    AG    AG
V42    AA    AA    AC    AA
V43    GG    GG    GG    GG
V44    CC    CC    AA    AC
V45    CT    TT    CC    CT
V46    CT    CC    CC    CC
V47    CT    CC    TT    CT
V48    AG    GG    GG    GG
V49    CC    CC    CC    00

Which is actually close to what I need. Given the massiveness of the file, a trivial thing such as removing the V column and the V row in Excel will not be doable. Any suggestion on how to do so in R? I can manually add the empty spaces to the map file, so once the Vs are gone, I will be able to join them and have the table I need.

ADD REPLY
0
Entering edit mode

The Vs are temporary headers and row numbers, they should not be bound to the data itself. If you're using RStudio (you should, if you're not), use View(awesome) to check out the actual data. You should see headers and a column called row.names (or some such) with the Vxx values.

ADD REPLY
0
Entering edit mode

The Vs showed up in Terminal and Excel (where it barely opens), so they are not temporary.

I redid it and added a row.names and col.names equals false in the write.table command, and the Vs were gone.

I will attempt to bind the tables later today.

I would use RStudio, but I am running my jobs on UF's High Performance Computing cluster, so everything is command-line. The files are sufficiently large that my i7 and my 8 gigs of DDR3 can't handle them.

ADD REPLY
0
Entering edit mode

I see. That makes sense. he bind should work fine now.

ADD REPLY
1
Entering edit mode
8.9 years ago
Uma A ▴ 230

.tped file is what you want. Just use plink to transpose your .ped and .map files into .tped and .tmap files. .tped file contains the data in the format that you want. Use the following command to get the tped:

plink --file filename --recode --transpose --out transposedfilename

To open the transposed file you just created,

plink --tfile transposedfilename

In the end, you will have to append the first row containing sample names by using any desired method.

The following links contain the help available on the .tped and .tfam formats:

ADD COMMENT
0
Entering edit mode

I don't need to transpose my map file though, just the ped file, so that I can combine them into one big table.

ADD REPLY
0
Entering edit mode
The .tped file is a combination of ped and map files and your original files do not get affected in any way.
ADD REPLY
0
Entering edit mode

For the --file option, wouldn't I need to do that twice since there are two files?

ADD REPLY
0
Entering edit mode
plink --file file --recode --transpose --out transposed.tped

I renamed my files to file.map and file.ped. This did not work like it was supposed to

Look at this output.

cut -f 1 transposed.tped.tped | head -1

1 rs3094315 0 742429 G A A A A A A A G A G A G A G A A A A A A A A A G A A A G A A A G A G A A A A A G A A A G G G A A A G A G A A A A A A A A A G A A A A A A A A A G A A A G A G A A A A A A A A A A A G A A A G A A A A A G G A A A A A A G A G A A A G A G A G A A A A A A A A A A A A A A A G A A A A A A A A A G A G G G A A A A A G A G A A A A A G A G A A A G A A A A A G A 0 0 G A A A A A A A G A G A G A G A G A G A A A G A G G A A G G A A G A G A A A G A G A G A G A A A G G A A A A G A A A A A A A A A A A G A G A A A A A G A G A G A A A A A G A G A A A G A A A A A A A A A G G A A G A A A G A G G A A 0 0 G A G G G A A A A A G A A A G A G A A A A A G A A A A A A A G A A A A A A A G A A A G A A A A A A A G A A A A A A A G A A A A A A A A A G A G A A A A A G A G A G A G A A A G A G A G A G A A A G G G A G A G G G A G G G G G G G G G A G G G A G A G G G G G G G A G A G A G A G A A A A A A A A A A A G A G A G A G A G A A A A A A A A A G A A A G A A A A A A A A A G A 0 0 A A A A A A A A G A A A A A A A A A A A A A A A G A A A G A A A A A A A A A A A A A A A A A G A G A A A A A A A A A G A 0 0 G A A A G A A A A A A A A A G A A A A A A A G G A A G A A A A A A A G A G A A A G A A A G A G A G A A A A A A A A A G A G A G A A A A A G A G A A A G A A A A A A A A A G G G G A A G A G A A A 0 0 G A G A A A A A G A G A A A A A A A G G G A G A A A A A A A 0 0 G A G A G A G A A A G A A A G A G A A A A A A A A A A A G A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A G A G A A A A A G A G A 0 0 A A G A A A G A A A A A G A A A A A G A A A A A G A G A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A G A 0 0 A A G A G A A A G A A A A A A A A A G A G A A A G A A A A A A A A A G G A A A A A A A A G A A A A A A A A A G A G A A A A A A A A A G A G A A A A A A A A A A A A A A A A A A A G A A A A A A A A A G A A A A A A A A A A A A A A A A A 0 0 G A A A A A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A G A G A A A A A A A A A A A A A A A A A G A A A A A A A G A G G G A A A A A A A G A A A A A G A A A A A G A A A G G A A A A A A G A A A G A G A A A A A A A A A A A A A A A A A G A A A A A G A A A G A G A G A G A G A G A A A G G G G G G G G G G G G G G G G G A A A G A G A G A G G G G G A G A G G A A G G G G G G G A G A G G G A G A G G G A G A G G A A A A A A G A G A A A A A G A A A A A A A G A G A A A A A A A A A A A A A A A G A A A A A A A A A A A A A A A A A A A A A A A G A G G G A A A G A G A A A G A A A G A A A G A A A A A A A G A A A A A A A G A A A A A A A A A A A A A A A A A A A A A G A A A A A 0 0 G A G A G A G A G A G A G A G A A A G G G A G G G G G G G G G A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A A A G A G A A A A A A A A A A A A A A A A A A A G A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A A A A A A A G G A A G G A A A A A A A A A A G G A A A A G A A A A A G A G A A A A A A A A A A A A A G A G G G G G G A A A A A A A A G A G A A A A A A A A A G A G A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A A A A A A A G G A A G A G A A A A A A A G A A A A A G A A A A A A A A A G A A A G A A A G A A A A A A A G A G A G A G A A A A A A A G G G A G G G A A A A A G A G A G G A A G A A A G A G G G A G G G A A A A A A A A A A A A A A A G A A A G A A A G A A A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A G A A A G A G A A A A A A A A A G A A A G A A A A A A A G A A A A A G A G A G A G A A A A A G A G A G A A A A A A A G A G A A A A A A A G A A A A A A A A A A A G A G A A A A A A A A A A A A A A A A A G A A A G A G G G A G A G G G G A A G G G G G A G A G G 0 0 0 0 G G 0 0 0 0 0 0 0 0
ADD REPLY
0
Entering edit mode

Firstly, to get your genotypes together, get your original ped/map set into a tab delimited format by using the following command:

plink --file original_filename --recode --tab --output tabSeparated

For the --file flag, you don't need to do that twice, because your .ped and .map files are a pair, and thus they should be of the same name like data.ped and data.map. Thus, your file name (for the --file flag) then becomes only "data"

Also, you don't need to give any extension to the output file. The command automatically creates a .tped and .tfam file pair for the name you have provided. Thus your output file option should only be "transposed". Thus, final command should be:

plink --file tabSeparated --recode --transpose --out transposed

When you load this file in Excel, you can get the individual genotypes in one column only.

Finally, you can merge the first 2 columns of your output file to get the first column of out file containing chromosome number and dbSNP ID together. For that matter, you can do any type of minor formatting in any text editor or Excel

ADD REPLY

Login before adding your answer.

Traffic: 2846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6