Question

Reshaping / Re-arranging the data

0

Entering edit mode

6.0 years ago

3335098459 ▴ 30

As I am new to R, this question may seem to you piece of a cake. I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms. For example:

0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374

I need to re-arrange/reshape the data in following format.

Cluster No. Org 1 Org 2 org3 org4

0 0 0 1
1 0 0 0

I could not figure out how to do it in R. Thanks

R genome • 1.0k views

ADD COMMENT • link updated 6.0 years ago by munizmom ▴ 60 • written 6.0 years ago by 3335098459 ▴ 30

0

Entering edit mode

clusters in example input are 12 (0 and 11 - 10 repeated) and expected output has only two clusters and organisms in input are 2 and in expected output are 4. Can you post matching input and output?

This can be done outside R. I replaced | with tabs using sed, renumbered rows as there were duplicates and added few more clusters to example OP data (given below).

in R: output with modified OP data:

> xtabs(~V1+V2, test)
    V2
V1   org1 org2 org3 org4
  0     0    0    0    1
  1     0    1    0    0
  2     1    0    0    0
  3     0    0    0    1
  4     0    0    1    0
  5     1    0    0    0
  6     1    0    0    0
  7     0    0    1    0
  9     1    0    0    0
  10    0    0    0    1
  11    0    1    0    0

outside R (with datamash and miller), output with modified OP data:

$ datamash -s crosstab 1,2 --filler 0 < test2.txt  |sed '1 s/^/Cluster No\./g' | mlr --itsv --otsv sort -n "Cluster No."

Cluster No. org1    org2    org3    org4
0   0   0   0   1
1   0   1   0   0
2   1   0   0   0
3   0   0   0   1
4   0   0   1   0
5   1   0   0   0
6   1   0   0   0
7   0   0   1   0
9   1   0   0   0
10  0   0   0   1
11  0   1   0   0

modified data:

$ cat test2.txt 
0   org4    gene759
1   org2    gene992
2   org1    gene1101
3   org4    gene757
4   org3    gene1702
5   org1    gene989
6   org1    gene990
7   org3    gene1699
9   org1    gene1102
10  org4    gene2439
11  org2    gene1374

ADD REPLY • link 6.0 years ago by cpad0112 21k

score 0 · Answer 1 · 2018-04-28

Hi, this is one way of reshaping it in R:

df <- data.frame(cluster=c(1:10), org=paste0("o",rep(c(1,1,2,3,3,3,3,4,4,4), times=1)), gene=paste0("g",rep(c(1:10), times=1))) #creating a dataframe alike yours

df_reshape <- tidyr::spread(df, "org","gene",fill = NA, convert=T) #reshaping the data

df_reshape[is.na(df_reshape[,])] <- 0 #missing values to 0

#if you wanted to really change the gene name by 1 then:

df$gene <- 1  # introducing this line of code before the reshaping