Question

Split the taxonomy column

1

Entering edit mode

4.2 years ago

Mengying ▴ 10

Hi

I have some difficulties splitting my taxonomy column into different rank, i.e."domain", "phylum", "class", "order", "family", "genus" .

The biggest problem is that the format in my taxonomy column is not uniform. Some of them have complete taxonomy levels, while others only have “domain”、“phylum”、“genus”levels.

My data has a few thousand rows and which looks something like this ：

 OTUID  Taxonomy
OTU1    d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20   d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774  d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4

I'm not familiar with R, so I've been searching for relevant solutions on the Internet a whole day, and I've tried the separate function in the tidyr package, like this

library(tidyr)
x <- read.csv("annotation.csv")
y <- x %>% separate(Taxonomy, c("domain", "phylum", "class", "order", "family", "genus"), ",[a-z]:")
write.csv(y,"tax_split.csv",row.names = TRUE)

But the result let me down. This can't split my taxonomy according to different ranks.

OTUID   domain  phylum  class   order   family  genus
OTU1    d:Bacteria  "Proteobacteria"    Gammaproteobacteria Pseudomonadales Pseudomonadaceae    Pseudomonas
OTU20   d:Archaea   "Thaumarchaeota"    Nitrososphaerales   Nitrososphaeraceae  Nitrososphaera  NA
OTU774  d:Bacteria  "Armatimonadetes"   Armatimonadetes_gp4 NA  NA  NA

Finally, I have to use the excel filtering function to deal with this, but this method is very time-consuming(╯︵╰) I still want to ask, is there any elegant way to use R to solve this problem？

Thanks for your help!

R • 1.5k views

ADD COMMENT • link updated 4.2 years ago by zx8754 11k • written 4.2 years ago by Mengying ▴ 10

score 1 · Answer 1 · 2020-03-02

Probably better to use the right tool for the job, as Ram suggested maybe treat that column as JSON.

Here is the solution using data.table, split on delimiters, reshape, again split, again reshape, see the comments and run the commands line by line to see what each step is doing:

library(data.table)

x <- fread('OTUID  Taxonomy
OTU1    d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20   d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774  d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4')

# split on "," add as new columns
x[, paste0("col", 1:6) := tstrsplit(Taxonomy, ",") ]

# reshape wide-to-long
x <- melt(x[, -2], id.vars = "OTUID")

# and again split on ":"
x <- x[, -2][, c("id", "type") := tstrsplit(value, ":")]

# finally reshape from long-to-wide
x <- dcast(x[!is.na(id), .(OTUID, id, type)], OTUID ~ id, value.var = "type")

# ta-da!
x
#     OTUID                   c        d                  f                   g                 o                 p
# 1:   OTU1 Gammaproteobacteria Bacteria   Pseudomonadaceae         Pseudomonas   Pseudomonadales  "Proteobacteria"
# 2:  OTU20                <NA>  Archaea Nitrososphaeraceae      Nitrososphaera Nitrososphaerales  "Thaumarchaeota"
# 3: OTU774                <NA> Bacteria               <NA> Armatimonadetes_gp4              <NA> "Armatimonadetes"

score 0 · Answer 2 · 2020-03-02

0

Entering edit mode

4.2 years ago

Ram 43k

I did some searching online and stumbled upon this package that seems pretty relevant to your task: https://www.rdocumentation.org/packages/roomba/versions/0.1.0

It cleans up JSON format data leaving room for a variable number of "columns" per entry.