Question: Reshape tabular data frame to wide format
1
gravatar for friasoler
2.0 years ago by
friasoler20
Germany
friasoler20 wrote:

Hello everyone! I would appreciate some help with this data frame 'df':

df

Marker Sample allele
FlaHDF11    CP26    h
FlaHDF11    CP26    a
FlaHDF12    CP26    e
FlaHDF12    CP26    f
FlaHDF11    CP27    g
FlaHDF11    CP27    h
FlaHDF12    CP27    t
FlaHDF12    CP27    z
I would like something tlike this:
    FlaHDF11    FlaHDF11    FlaHDF12    FlaHDF12
CP26    h   a   e   f
CP27    g   h   t   z

#This is the code Im using and I suspect the problem is in the line (6). Im always expecting to have just two rows for the same sample, but the code is optimized to add extra columns in case some samples have more than 2 alleles.

a = read.table('df', stringsAsFactors=F,header=T)
a = a[-1,]
a1 = a[!duplicated(a[,1:2]),]
rownames(a1)  = paste(a1$Marker,a1$Sample)
t=1;
while(sum(duplicated(a[,1:2])) >0)   #### I think the problem is here
{   
    a1 = cbind(a1,rep(NA, dim(a1)[[1]]))
    a = a[duplicated(a[,1:2]),]
    a2 = a[!duplicated(a[,1:2]),]
    a1[paste(a2$Marker,a2$Sample),t+3] = a2$Size
    t = t+1
}
 colnames(a1) = c(colnames(a), paste('Size',1:(t-1),sep='.') )
a2 = a1[!is.na(a1$Size.2),]
m = matrix(NA,length(unique(a1$Sample)), length(unique(a1$Marker))*t)
rownames(m) = unique(a1$Sample)
colnames(m) = paste(rep(unique(a1$Marker), each=t), rep(1:4,length(unique(a1$Marker))), sep='.')
for (i in rownames(m))
{
    tt = a1[a1$Sample %in% i,]
    m[ i, paste(rep(tt$Marker,each=4), rep(1:4, dim(tt)[[1]]), sep='.')] = as.vector(t(tt[,3:6])) 
}
 m = m[,colSumsis.na(m)) < dim(m)[[1]]]

#### Error in [<-.data.frame(*tmp*, paste(a2$Marker, a2$Sample), t + 3, : replacement has length zero

Kind regards! Roberto

R • 713 views
ADD COMMENTlink modified 2.0 years ago by cmdcolin1.2k • written 2.0 years ago by friasoler20

Just in case this is a "XY problem" (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) it could be worth contextualizing why you want to reformat your data this way

ADD REPLYlink written 2.0 years ago by cmdcolin1.2k

It is also unclear why you have multiple columns named the same thing in your "desired output". that complicates any potential code solution as it doesn't really make sense why the same thing would have two different columns

ADD REPLYlink written 2.0 years ago by cmdcolin1.2k

It has sense...the columns are loci...each one with two alleles by sample ... Thanks

ADD REPLYlink written 2.0 years ago by friasoler20

See my answer that tries to explicitly codify this assumption

ADD REPLYlink written 2.0 years ago by cmdcolin1.2k
3
gravatar for cmdcolin
2.0 years ago by
cmdcolin1.2k
United States
cmdcolin1.2k wrote:

If we go ahead and "codify the assumption that there are two alleles" by renaming the marker column, then you can get the desired output similar to this

library(reshape2)
x=read.table(text='Marker Sample allele
       FlaHDF11    CP26    h
       FlaHDF11    CP26    a
       FlaHDF12    CP26    e
       FlaHDF12    CP26    f
       FlaHDF11    CP27    g
       FlaHDF11    CP27    h
       FlaHDF12    CP27    t
       FlaHDF12    CP27    z',header=T,stringsAsFactors=F)

x[seq(1,nrow(x),by=2),]$Marker=paste0(x[seq(1,nrow(x),by=2),]$Marker,'a')
x[seq(2,nrow(x),by=2),]$Marker=paste0(x[seq(2,nrow(x),by=2),]$Marker,'b')

acast(x,Sample~Marker)




#Output
#FlaHDF11a FlaHDF11b FlaHDF12a FlaHDF12b
#CP26 "h"       "a"       "e"       "f"      
#CP27 "g"       "h"       "t"       "z"
ADD COMMENTlink written 2.0 years ago by cmdcolin1.2k
1
gravatar for igor
2.0 years ago by
igor7.6k
United States
igor7.6k wrote:

You don't need to do it manually. Try the tidyr spread() function: http://tidyr.tidyverse.org/reference/spread.html

ADD COMMENTlink written 2.0 years ago by igor7.6k
0
gravatar for jrj.healey
2.0 years ago by
jrj.healey12k
United Kingdom
jrj.healey12k wrote:

melt or tidyr is usually used to long format dataframes

ADD COMMENTlink written 2.0 years ago by jrj.healey12k
1

melt comes from the reshape2 package, but that converts to long format. the question is about converting to wide format. in that case, if using reshape2, you'd most likely use the acast function

ADD REPLYlink written 2.0 years ago by cmdcolin1.2k

Ah yes sorry misread that bit.

ADD REPLYlink written 2.0 years ago by jrj.healey12k

Thanks a lot Healey!

ADD REPLYlink written 24 months ago by friasoler20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour