Reshaping / Re-arranging the data
1
0
Entering edit mode
6.0 years ago
3335098459 ▴ 30

As I am new to R, this question may seem to you piece of a cake. I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms. For example:

  1. 0 org4|gene759
  2. 1 org1|gene992
  3. 2 org1|gene1101
  4. 3 org4|gene757
  5. 4 org1|gene1702
  6. 5 org1|gene989
  7. 6 org1|gene990
  8. 7 org1|gene1699
  9. 9 org1|gene1102
  10. 10 org4|gene2439
  11. 10 org1|gene1374

I need to re-arrange/reshape the data in following format.

Cluster No. Org 1 Org 2 org3 org4

  1. 0 0 0 1
  2. 1 0 0 0

I could not figure out how to do it in R. Thanks

R genome • 1.0k views
ADD COMMENT
0
Entering edit mode

clusters in example input are 12 (0 and 11 - 10 repeated) and expected output has only two clusters and organisms in input are 2 and in expected output are 4. Can you post matching input and output?

This can be done outside R. I replaced | with tabs using sed, renumbered rows as there were duplicates and added few more clusters to example OP data (given below).

in R: output with modified OP data:

> xtabs(~V1+V2, test)
    V2
V1   org1 org2 org3 org4
  0     0    0    0    1
  1     0    1    0    0
  2     1    0    0    0
  3     0    0    0    1
  4     0    0    1    0
  5     1    0    0    0
  6     1    0    0    0
  7     0    0    1    0
  9     1    0    0    0
  10    0    0    0    1
  11    0    1    0    0

outside R (with datamash and miller), output with modified OP data:

$ datamash -s crosstab 1,2 --filler 0 < test2.txt  |sed '1 s/^/Cluster No\./g' | mlr --itsv --otsv sort -n "Cluster No."

Cluster No. org1    org2    org3    org4
0   0   0   0   1
1   0   1   0   0
2   1   0   0   0
3   0   0   0   1
4   0   0   1   0
5   1   0   0   0
6   1   0   0   0
7   0   0   1   0
9   1   0   0   0
10  0   0   0   1
11  0   1   0   0

modified data:

$ cat test2.txt 
0   org4    gene759
1   org2    gene992
2   org1    gene1101
3   org4    gene757
4   org3    gene1702
5   org1    gene989
6   org1    gene990
7   org3    gene1699
9   org1    gene1102
10  org4    gene2439
11  org2    gene1374
ADD REPLY
0
Entering edit mode
6.0 years ago
munizmom ▴ 60

Hi, this is one way of reshaping it in R:

df <- data.frame(cluster=c(1:10), org=paste0("o",rep(c(1,1,2,3,3,3,3,4,4,4), times=1)), gene=paste0("g",rep(c(1:10), times=1))) #creating a dataframe alike yours

df_reshape <- tidyr::spread(df, "org","gene",fill = NA, convert=T) #reshaping the data

df_reshape[is.na(df_reshape[,])] <- 0 #missing values to 0

#if you wanted to really change the gene name by 1 then:

df$gene <- 1  # introducing this line of code before the reshaping
ADD COMMENT

Login before adding your answer.

Traffic: 2228 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6