Split the taxonomy column
2
1
Entering edit mode
4.2 years ago
Mengying ▴ 10

Hi

I have some difficulties splitting my taxonomy column into different rank, i.e."domain", "phylum", "class", "order", "family", "genus" .

The biggest problem is that the format in my taxonomy column is not uniform. Some of them have complete taxonomy levels, while others only have “domain”、“phylum”、“genus”levels.

My data has a few thousand rows and which looks something like this :

 OTUID  Taxonomy
OTU1    d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20   d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774  d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4

I'm not familiar with R, so I've been searching for relevant solutions on the Internet a whole day, and I've tried the separate function in the tidyr package, like this

library(tidyr)
x <- read.csv("annotation.csv")
y <- x %>% separate(Taxonomy, c("domain", "phylum", "class", "order", "family", "genus"), ",[a-z]:")
write.csv(y,"tax_split.csv",row.names = TRUE)

But the result let me down. This can't split my taxonomy according to different ranks.

OTUID   domain  phylum  class   order   family  genus
OTU1    d:Bacteria  "Proteobacteria"    Gammaproteobacteria Pseudomonadales Pseudomonadaceae    Pseudomonas
OTU20   d:Archaea   "Thaumarchaeota"    Nitrososphaerales   Nitrososphaeraceae  Nitrososphaera  NA
OTU774  d:Bacteria  "Armatimonadetes"   Armatimonadetes_gp4 NA  NA  NA

Finally, I have to use the excel filtering function to deal with this, but this method is very time-consuming(╯︵╰) I still want to ask, is there any elegant way to use R to solve this problem?

Thanks for your help!

R • 1.5k views
ADD COMMENT
1
Entering edit mode
4.2 years ago
zx8754 11k

Probably better to use the right tool for the job, as Ram suggested maybe treat that column as JSON.

Here is the solution using data.table, split on delimiters, reshape, again split, again reshape, see the comments and run the commands line by line to see what each step is doing:

library(data.table)

x <- fread('OTUID  Taxonomy
OTU1    d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20   d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774  d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4')

# split on "," add as new columns
x[, paste0("col", 1:6) := tstrsplit(Taxonomy, ",") ]

# reshape wide-to-long
x <- melt(x[, -2], id.vars = "OTUID")

# and again split on ":"
x <- x[, -2][, c("id", "type") := tstrsplit(value, ":")]

# finally reshape from long-to-wide
x <- dcast(x[!is.na(id), .(OTUID, id, type)], OTUID ~ id, value.var = "type")

# ta-da!
x
#     OTUID                   c        d                  f                   g                 o                 p
# 1:   OTU1 Gammaproteobacteria Bacteria   Pseudomonadaceae         Pseudomonas   Pseudomonadales  "Proteobacteria"
# 2:  OTU20                <NA>  Archaea Nitrososphaeraceae      Nitrososphaera Nitrososphaerales  "Thaumarchaeota"
# 3: OTU774                <NA> Bacteria               <NA> Armatimonadetes_gp4              <NA> "Armatimonadetes"
ADD COMMENT
0
Entering edit mode

Thank you for taking the time to try to solve this problem. I got a lot of help. Just as you said, I tried the code step by step. And I want to make one that doesn't discard NA,I try to modify your last line of code,likex <- dcast(x[, .(OTUID, id, type)], OTUID ~ id, value.var = "type"),but the R always reports an error. I think the x[!is.na(id), .(OTUID, id, type)] is to select data to function dcast, and I also try x <- dcast( x[, c("OTUID", "id", "type")], OTUID ~ id, value.var = "type"),but it failed again. How to make the right change? Did I get the wrong understanding? Thanks again~

ADD REPLY
0
Entering edit mode

It is good practice to specify the exact error message when you encounter an error. If the message has sensitive information, mask it with some placeholder that makes sense, but just saying "I see an error" does not help us figure out what could be going on.

ADD REPLY
0
Entering edit mode
4.2 years ago
Ram 43k

I did some searching online and stumbled upon this package that seems pretty relevant to your task: https://www.rdocumentation.org/packages/roomba/versions/0.1.0

It cleans up JSON format data leaving room for a variable number of "columns" per entry.

ADD COMMENT
0
Entering edit mode

Thanks a lot! This gives me a new idea and a new tool to solve this problem

ADD REPLY

Login before adding your answer.

Traffic: 3085 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6