Reshape tabular data frame to wide format
3
1
Entering edit mode
7.0 years ago
friasoler ▴ 50

Hello everyone! I would appreciate some help with this data frame 'df':

df

Marker Sample allele
FlaHDF11    CP26    h
FlaHDF11    CP26    a
FlaHDF12    CP26    e
FlaHDF12    CP26    f
FlaHDF11    CP27    g
FlaHDF11    CP27    h
FlaHDF12    CP27    t
FlaHDF12    CP27    z
I would like something tlike this:
    FlaHDF11    FlaHDF11    FlaHDF12    FlaHDF12
CP26    h   a   e   f
CP27    g   h   t   z

#This is the code Im using and I suspect the problem is in the line (6). Im always expecting to have just two rows for the same sample, but the code is optimized to add extra columns in case some samples have more than 2 alleles.

a = read.table('df', stringsAsFactors=F,header=T)
a = a[-1,]
a1 = a[!duplicated(a[,1:2]),]
rownames(a1)  = paste(a1$Marker,a1$Sample)
t=1;
while(sum(duplicated(a[,1:2])) >0)   #### I think the problem is here
{   
    a1 = cbind(a1,rep(NA, dim(a1)[[1]]))
    a = a[duplicated(a[,1:2]),]
    a2 = a[!duplicated(a[,1:2]),]
    a1[paste(a2$Marker,a2$Sample),t+3] = a2$Size
    t = t+1
}
 colnames(a1) = c(colnames(a), paste('Size',1:(t-1),sep='.') )
a2 = a1[!is.na(a1$Size.2),]
m = matrix(NA,length(unique(a1$Sample)), length(unique(a1$Marker))*t)
rownames(m) = unique(a1$Sample)
colnames(m) = paste(rep(unique(a1$Marker), each=t), rep(1:4,length(unique(a1$Marker))), sep='.')
for (i in rownames(m))
{
    tt = a1[a1$Sample %in% i,]
    m[ i, paste(rep(tt$Marker,each=4), rep(1:4, dim(tt)[[1]]), sep='.')] = as.vector(t(tt[,3:6])) 
}
 m = m[,colSumsis.na(m)) < dim(m)[[1]]]

#### Error in [<-.data.frame(*tmp*, paste(a2$Marker, a2$Sample), t + 3, : replacement has length zero

Kind regards! Roberto

R • 2.2k views
ADD COMMENT
0
Entering edit mode

Just in case this is a "XY problem" (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) it could be worth contextualizing why you want to reformat your data this way

ADD REPLY
0
Entering edit mode

It is also unclear why you have multiple columns named the same thing in your "desired output". that complicates any potential code solution as it doesn't really make sense why the same thing would have two different columns

ADD REPLY
0
Entering edit mode

It has sense...the columns are loci...each one with two alleles by sample ... Thanks

ADD REPLY
0
Entering edit mode

See my answer that tries to explicitly codify this assumption

ADD REPLY
3
Entering edit mode
7.0 years ago
cmdcolin ★ 3.8k

If we go ahead and "codify the assumption that there are two alleles" by renaming the marker column, then you can get the desired output similar to this

library(reshape2)
x=read.table(text='Marker Sample allele
       FlaHDF11    CP26    h
       FlaHDF11    CP26    a
       FlaHDF12    CP26    e
       FlaHDF12    CP26    f
       FlaHDF11    CP27    g
       FlaHDF11    CP27    h
       FlaHDF12    CP27    t
       FlaHDF12    CP27    z',header=T,stringsAsFactors=F)

x[seq(1,nrow(x),by=2),]$Marker=paste0(x[seq(1,nrow(x),by=2),]$Marker,'a')
x[seq(2,nrow(x),by=2),]$Marker=paste0(x[seq(2,nrow(x),by=2),]$Marker,'b')

acast(x,Sample~Marker)




#Output
#FlaHDF11a FlaHDF11b FlaHDF12a FlaHDF12b
#CP26 "h"       "a"       "e"       "f"      
#CP27 "g"       "h"       "t"       "z"
ADD COMMENT
1
Entering edit mode
7.0 years ago
igor 13k

You don't need to do it manually. Try the tidyr spread() function: http://tidyr.tidyverse.org/reference/spread.html

ADD COMMENT
0
Entering edit mode
7.0 years ago
Joe 21k

melt or tidyr is usually used to long format dataframes

ADD COMMENT
1
Entering edit mode

melt comes from the reshape2 package, but that converts to long format. the question is about converting to wide format. in that case, if using reshape2, you'd most likely use the acast function

ADD REPLY
0
Entering edit mode

Ah yes sorry misread that bit.

ADD REPLY
0
Entering edit mode

Thanks a lot Healey!

ADD REPLY

Login before adding your answer.

Traffic: 2023 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6