Question: Need help transposing MAP file and PED file into one large table
1
gravatar for devenvyas
3.9 years ago by
devenvyas570
Stony Brook
devenvyas570 wrote:

I have a large PED file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.ped.gz) and a large MAP file (ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz) that I am trying to work with.

The slightly modified map file I have has >600,000 rows

1 rs3094315 0 742429
1 rs7419119 0 831876
1 rs13302957 0 880884
1 rs6696609 0 893289
1 rs8997 0 939517
1 rs9442372 0 1008567
1 rs147606383 0 1035194
1 rs4970405 0 1038818
1 rs11807848 0 1051029
1 rs4970421 0 1098500
1 rs1320571 0 1110294
1 rs2887286 0 1145994
1 rs79118541 0 1147410
1 rs3813199 0 1148140
1 rs113791678 0 1151643
1 rs78424188 0 1160450
1 rs12073590 0 1195018
1 rs6685064 0 1201155
1 rs61559999 0 1225655
1 rs60785581 0 1225708

The slightly modified ped has 942 rows, where each row is an individual and each column is a genotype correspond to the associated row of map file

HGDP00001 HGDP00001 0 0 1 0 AG GT AA CC GG AG GG AA
HGDP00003 HGDP00003 0 0 1 0 AA GT AA TT GG GG GG AA
HGDP00005 HGDP00005 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00007 HGDP00007 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00011 HGDP00011 0 0 1 0 AG GT AA CT GG AG GG AA
HGDP00013 HGDP00013 0 0 1 0 AG TT AA CC AG AG GG AA
HGDP00015 HGDP00015 0 0 1 0 AG GT AA CT AG GG GG AA
HGDP00017 HGDP00017 0 0 1 0 AG GT AA CC GG GG GG AA
HGDP00019 HGDP00019 0 0 1 0 AA TT AG CT GG GG GG AA
HGDP00021 HGDP00021 0 0 1 0 AA GT AA TT GG AA GG AA
HGDP00023 HGDP00023 0 0 1 0 AA GT AA TT GG AG GG AA
HGDP00025 HGDP00025 0 0 1 0 AA GT AA CT GG AG GG AA
HGDP00027 HGDP00027 0 0 1 0 AG GT AA CT AG AG GG AA
HGDP00029 HGDP00029 0 0 1 0 AA TT AA CC GG GG GG AA
HGDP00031 HGDP00031 0 0 1 0 AG GT AG CT GG GG GG AA
HGDP00033 HGDP00033 0 0 1 0 AA TT AG CC GG AA GG AG
HGDP00035 HGDP00035 0 0 1 0 AG TT AG CT GG AG GG AA
HGDP00037 HGDP00037 0 0 1 0 AG GT AA TT AG GG GG AG

 

I trying to get them into a single table with a similar formatting to some Affy array data I have (which has rows as SNP ids and columns as individuals).

I was wondering if anyone could help figure out a Python or Bash scripting solution to transpose the ped file such that the 1st row of the ped file becomes the 5th column of the map file, and the 2nd row of the ped becomes the 6th of the map file, and so on...

Basically, I want it to look like this (I presume subsequently taking out the 0/1 rows and location columns will be fairly simple?)

        HGDP00001 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033 HGDP00035 HGDP00037 HGDP00039
        HGDP00001 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033 HGDP00035 HGDP00037 HGDP00039
        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
        1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
        0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 rs3094315 0 742429 AG AA AA AA AG AG AG AG AA AA AA AA AG AA AG AA AG AG AA
1 rs7419119 0 831876 GT GT TT TT GT TT GT GT TT GT GT GT GT TT GT TT TT GT TT
1 rs13302957 0 880884 AA AA AA AA AA AA AA AA AG AA AA AA AA AA AG AG AG AA AA
1 rs6696609 0 893289 CC TT CC CC CT CC CT CC CT TT TT CT CT CC CT CC CT TT CT
1 rs8997 0 939517 GG GG GG GG GG AG AG GG GG GG GG GG AG GG GG GG GG AG GG
1 rs9442372 0 1008567 AG GG GG GG AG AG GG GG GG AA AG AG AG GG GG AA AG GG GG
1 rs147606383 0 1035194 GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
1 rs4970405 0 1038818 AA AA AA AA AA AA AA AA AA AA AA AA AA AA AA AG AA AG AA
bash snp python table • 2.1k views
ADD COMMENTlink modified 3.9 years ago by Uma A220 • written 3.9 years ago by devenvyas570

R would make this super easy, no? Does the solution *have* to be in bash/python?

ADD REPLYlink written 3.9 years ago by RamRS21k

Doesn't have to be, it is just that I have more experience with bash/python. I do have some background in R

ADD REPLYlink written 3.9 years ago by devenvyas570
2
gravatar for RamRS
3.9 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

Here's my (quick and dirty) solution in R:

##read files
map_file<-read.csv('./mapfile.csv')
ped_file<-read.csv('./pedfile.csv')

##transform and merge data

#create matrix of blanks
matrix_of_blanks<-matrix(data="",nrow=6,ncol=4)

#bind matrix of blanks to map file
tr_map<-rbind(matrix_of_blanks,as.matrix(map_file))

#combine transformed map file and transposed ped file
cbind(tr_map,t(ped_file))
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by RamRS21k

I ran this R code

setwd("/scratch/lfs/vyas/HGDPcomp")

##read files
map_file<-read.csv('all_snp_out.map')
ped_file<-read.csv('all_snp_nospace.ped')

##transform and merge data

#create matrix of blanks
matrix_of_blanks<-matrix(data="",nrow=6,ncol=4)

#bind matrix of blanks to map file
tr_map<-rbind(matrix_of_blanks,as.matrix(map_file))

#combine transformed map file and transposed ped file
cbind(tr_map,t(ped_file)) -> transposed
write.csv(transposed, file='table.csv')

and then I got this result back from the cluster:

Error in rbind(matrix_of_blanks, as.matrix(map_file)) :
  number of columns of matrices must match (see arg 2)
Execution halted

I think I will try to transpose the ped file first, and then manually transform the map file in excel

ADD REPLYlink written 3.9 years ago by devenvyas570
ped_file<-read.csv('all_snp_nospace.ped', sep = '\t')
t(ped_file) -> transposed
write.csv(transposed, sep = '\t', quote=FALSE, file='table.csv')

The R method is not working. It is writing with comma-delimiters and the first row is a whole bunch of Vs and the first column appears to merging some columns

This is a bit of my output... (copied out of Excel)

  V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15
HGDP00001 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033
HGDP00001.1 HGDP00003 HGDP00005 HGDP00007 HGDP00011 HGDP00013 HGDP00015 HGDP00017 HGDP00019 HGDP00021 HGDP00023 HGDP00025 HGDP00027 HGDP00029 HGDP00031 HGDP00033
X0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
X0.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
X1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
X0.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
AG AA AA AA AG AG AG AG AA AA AA AA AG AA AG AA
GT GT TT TT GT TT GT GT TT GT GT GT GT TT GT TT
AA AA AA AA AA AA AA AA AG AA AA AA AA AA AG AG
CC TT CC CC CT CC CT CC CT TT TT CT CT CC CT CC
GG GG GG GG GG AG AG GG GG GG GG GG AG GG GG GG
AG.1 GG GG GG AG AG GG GG GG AA AG AG AG GG GG AA
GG.1 GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
AA.1 AA AA AA AA AA AA AA AA AA AA AA AA AA AA AG
CT TT TT CT CT TT TT CT TT CC CT CT CT TT TT CC
GG.2 GG GG GG GG AG AG GG AG GG GG GG GG GG AG GG
AG.2 GG GG AG GG AG GG GG AG GG GG GG GG GG AG GG
TT TT TT CT CT CT CC CT CT TT CT TT TT CT CC CC
TT.1 TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
GG.3 GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
CC.1 CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC
CC.2 CC CC CC CC 0 CC CC CC CC CC CC CC CC CC CC
AA.2 AA AA AA AA AA AA AA AA AA AA AA AC AA AA AA
CT.1 CC CC CC CT CT CC CC CC CC CC CC CT CT CT CC
CC.3 CC CC CC CC CC CC CC CC CC CC CC CC CC CC CC
CC.4 CC CC 0 CC CC CC CC CC CC CC CC 0 CC CC CC
GG.4 GG GG GG GG GG AA GG GG GG GG GG GG GG GG GG
GG.5 GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG
TT.2 TT TT TT TT TT CC TT TT TT TT TT CT TT TT CT
TT.3 TT TT TT TT TT CC TT TT TT TT TT TT TT TT CT
GG.6 GG GG GG GG GG AA GG GG GG GG GG GG GG GG GG
AG.3 AG AA AG AG AG GG AG AG AA AA AA AG AG AG AG

I will try the Plink method later today.

 

 

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by devenvyas570

write.csv will write csv. Not sure why you're writing csv here in the first place.

ADD REPLYlink written 3.9 years ago by RamRS21k

Please note that I converted map and ped files to CSV before reading them in. If you're reading them directly, you're better off using read.table(file,sep="\t",header=FALSE,stringsAsFactors=FALSE)

This might solve the unequal number of rows problem. Also, please run step by step so you know which rbind is erroring out.

ADD REPLYlink written 3.9 years ago by RamRS21k
ped_file<-read.table('all_snp_nospace.ped', sep="\t", header=FALSE, stringsAsFactors=FALSE)
t(ped_file) -> awesome
write.table(awesome, sep = '\t', quote=FALSE, file='table.csv')

Yielded this:

V1    V2    V3    V4    V5
V1    HGDP00001    HGDP00003    HGDP00005    HGDP00007
V2    HGDP00001    HGDP00003    HGDP00005    HGDP00007
V3    0    0    0    0
V4    0    0    0    0
V5    1    1    1    1
V6    0    0    0    0
V7    AG    AA    AA    AA
V8    GT    GT    TT    TT
V9    AA    AA    AA    AA
V10    CC    TT    CC    CC
V11    GG    GG    GG    GG
V12    AG    GG    GG    GG
V13    GG    GG    GG    GG
V14    AA    AA    AA    AA
V15    CT    TT    TT    CT
V16    GG    GG    GG    GG
V17    AG    GG    GG    AG
V18    TT    TT    TT    CT
V19    TT    TT    TT    TT
V20    GG    GG    GG    GG
V21    CC    CC    CC    CC
V22    CC    CC    CC    CC
V23    AA    AA    AA    AA
V24    CT    CC    CC    CC
V25    CC    CC    CC    CC
V26    CC    CC    CC    00
V27    GG    GG    GG    GG
V28    GG    GG    GG    GG
V29    TT    TT    TT    TT
V30    TT    TT    TT    TT
V31    GG    GG    GG    GG
V32    AG    AG    AA    AG
V33    AA    AA    AA    AA
V34    TT    TT    TT    TT
V35    CC    CC    CC    CC
V36    AA    AA    GG    AG
V37    AG    AG    GG    GG
V38    CT    CT    TT    CT
V39    AG    AG    GG    GG
V40    CC    CC    CT    CT
V41    GG    GG    AG    AG
V42    AA    AA    AC    AA
V43    GG    GG    GG    GG
V44    CC    CC    AA    AC
V45    CT    TT    CC    CT
V46    CT    CC    CC    CC
V47    CT    CC    TT    CT
V48    AG    GG    GG    GG
V49    CC    CC    CC    00

Which is actually close to what I need. Given the massiveness of the file, a trivial thing such as removing the V column and the V row in Excel will not be doable. Any suggestion on how to do so in R? I can manually add the empty spaces to the map file, so once the Vs are gone, I will be able to join them and have the table I need.

 

ADD REPLYlink written 3.9 years ago by devenvyas570

The Vs are temporary headers and row numbers, they should not be bound to the data itself. If you're using RStudio (you should, if you're not), use View(awesome) to check out the actual data. You should see headers and a column called row.names (or some such) with the Vxx values.

ADD REPLYlink written 3.9 years ago by RamRS21k

The Vs showed up in Terminal and Excel (where it barely opens), so they are not temporary.

I redid it and added a row.names and col.names equals false in the write.table command, and the Vs were gone.

I will attempt to bind the tables later today.

I would use RStudio, but I am running my jobs on UF's High Performance Computing cluster, so everything is command-line. The files are sufficiently large that my i7 and my 8 gigs of DDR3 can't handle them.

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by devenvyas570

I see. That makes sense. he bind should work fine now.

ADD REPLYlink written 3.9 years ago by RamRS21k
1
gravatar for Uma A
3.9 years ago by
Uma A220
India
Uma A220 wrote:

.tped file is what you want. Just use plink to transpose your .ped and .map files into .tped and .tmap files. .tped file contains the data in the format that you want. Use the following command to get the tped:

plink --file filename --recode --transpose --out transposedfilename

To open the transposed file you just created,

plink --tfile transposedfilename

 

In the end, you will have to append the first row containing sample names by using any desired method.

The following links contain the help available on the .tped and .tfam formats:

http://pngu.mgh.harvard.edu/~purcell/plink/dataman.shtml#recode

http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#tr

 

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Uma A220

I don't need to transpose my map file though, just the ped file, so that I can combine them into one big table.

ADD REPLYlink written 3.9 years ago by devenvyas570
The .tped file is a combination of ped and map files and your original files do not get affected in any way.
ADD REPLYlink written 3.9 years ago by Uma A220

For the --file option, wouldn't I need to do that twice since there are two files?

ADD REPLYlink written 3.9 years ago by devenvyas570
plink --file file --recode --transpose --out transposed.tped

I renamed my files to file.map and file.ped. This did not work like it was supposed to

Look at this output.

cut -f 1 transposed.tped.tped | head -1

1 rs3094315 0 742429 G A A A A A A A G A G A G A G A A A A A A A A A G A A A G A A A G A G A A A A A G A A A G G G A A A G A G A A A A A A A A A G A A A A A A A A A G A A A G A G A A A A A A A A A A A G A A A G A A A A A G G A A A A A A G A G A A A G A G A G A A A A A A A A A A A A A A A G A A A A A A A A A G A G G G A A A A A G A G A A A A A G A G A A A G A A A A A G A 0 0 G A A A A A A A G A G A G A G A G A G A A A G A G G A A G G A A G A G A A A G A G A G A G A A A G G A A A A G A A A A A A A A A A A G A G A A A A A G A G A G A A A A A G A G A A A G A A A A A A A A A G G A A G A A A G A G G A A 0 0 G A G G G A A A A A G A A A G A G A A A A A G A A A A A A A G A A A A A A A G A A A G A A A A A A A G A A A A A A A G A A A A A A A A A G A G A A A A A G A G A G A G A A A G A G A G A G A A A G G G A G A G G G A G G G G G G G G G A G G G A G A G G G G G G G A G A G A G A G A A A A A A A A A A A G A G A G A G A G A A A A A A A A A G A A A G A A A A A A A A A G A 0 0 A A A A A A A A G A A A A A A A A A A A A A A A G A A A G A A A A A A A A A A A A A A A A A G A G A A A A A A A A A G A 0 0 G A A A G A A A A A A A A A G A A A A A A A G G A A G A A A A A A A G A G A A A G A A A G A G A G A A A A A A A A A G A G A G A A A A A G A G A A A G A A A A A A A A A G G G G A A G A G A A A 0 0 G A G A A A A A G A G A A A A A A A G G G A G A A A A A A A 0 0 G A G A G A G A A A G A A A G A G A A A A A A A A A A A G A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A G A G A A A A A G A G A 0 0 A A G A A A G A A A A A G A A A A A G A A A A A G A G A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A G A 0 0 A A G A G A A A G A A A A A A A A A G A G A A A G A A A A A A A A A G G A A A A A A A A G A A A A A A A A A G A G A A A A A A A A A G A G A A A A A A A A A A A A A A A A A A A G A A A A A A A A A G A A A A A A A A A A A A A A A A A 0 0 G A A A A A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A G A G A A A A A A A A A A A A A A A A A G A A A A A A A G A G G G A A A A A A A G A A A A A G A A A A A G A A A G G A A A A A A G A A A G A G A A A A A A A A A A A A A A A A A G A A A A A G A A A G A G A G A G A G A G A A A G G G G G G G G G G G G G G G G G A A A G A G A G A G G G G G A G A G G A A G G G G G G G A G A G G G A G A G G G A G A G G A A A A A A G A G A A A A A G A A A A A A A G A G A A A A A A A A A A A A A A A G A A A A A A A A A A A A A A A A A A A A A A A G A G G G A A A G A G A A A G A A A G A A A G A A A A A A A G A A A A A A A G A A A A A A A A A A A A A A A A A A A A A G A A A A A 0 0 G A G A G A G A G A G A G A G A A A G G G A G G G G G G G G G A A A A A A A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A A A G A G A A A A A A A A A A A A A A A A A A A G A A A A A A A A A A A A A A A G A G A A A A A A A A A A A A A G A A A A A A A G G A A G G A A A A A A A A A A G G A A A A G A A A A A G A G A A A A A A A A A A A A A G A G G G G G G A A A A A A A A G A G A A A A A A A A A G A G A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A A A A A A A G G A A G A G A A A A A A A G A A A A A G A A A A A A A A A G A A A G A A A G A A A A A A A G A G A G A G A A A A A A A G G G A G G G A A A A A G A G A G G A A G A A A G A G G G A G G G A A A A A A A A A A A A A A A G A A A G A A A G A A A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A G A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A G A A A G A A A G A G A A A A A A A A A G A A A G A A A A A A A G A A A A A G A G A G A G A A A A A G A G A G A A A A A A A G A G A A A A A A A G A A A A A A A A A A A G A G A A A A A A A A A A A A A A A A A G A A A G A G G G A G A G G G G A A G G G G G A G A G G 0 0 0 0 G G 0 0 0 0 0 0 0 0


 

ADD REPLYlink written 3.9 years ago by devenvyas570

Firstly, to get your genotypes together, get your original ped/map set into a tab delimited format by using the following command:

plink --file original_filename --recode --tab --output tabSeparated

For the --file flag, you don't need to do that twice, because your .ped and .map files are a pair, and thus they should be of the same name like data.ped and data.map. Thus, your file name (for the --file flag) then becomes only "data"

Also, you don't need to give any extension to the output file. The command automatically creates a .tped and .tfam file pair for the name you have provided. Thus your output file option should only be "transposed". Thus, final command should be:

plink --file tabSeparated --recode --transpose --out transposed

When you load this file in Excel, you can get the individual genotypes in one column only.

Finally, you can merge the first 2 columns of your output file to get the first column of out file containing chromosome number and dbSNP ID together. For that matter, you can do any type of minor formatting in any text editor or Excel

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Uma A220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 669 users visited in the last hour