Question

Merging values in a single data frame

1

Entering edit mode

6.4 years ago

Mozart ▴ 330

hello everyone, I am trying to convert ENSEMBLE gene name to NCBI values; at the end of the process, I merged 2 data frames (one containing converted values and another one with log fold change, padj values assigned for each ensemble id). Unfortunately, some values like this:

ENSMUSG00000000562  c("11542"

rather than simply

ENSMUSG00000000562  11542

here the code

list <- read.csv(file='~/list.csv')
list <- as.character(list)

a <- sapply(list, function(x) exists(x, org.Mm.egENSEMBL2EG))
my.list <- list[a]

xx <- as.list(org.Mm.egENSEMBL2EG)
xx[my.list]

then

xx_table <- as.array(xx)
xx_table<- as.data.frame(xx_table)
xx_table<- as.matrix(xx_table)
write.csv(xx_table, file='~/ens2ncbi.csv')

ens2ncbi<- read.csv(file='~/ens2ncbi.csv')
ens2ncbi<-ens2ncbi[, 2:1]

merge.data <- merge(ens2ncbi, OTHER_dataframe, by="en")
write.csv(merge.data, file='~/ens2ncbi_merged.csv')

R • 1.6k views

ADD COMMENT • link 6.1 years ago by Mozart ▴ 330

score 3 · Accepted Answer · 2017-11-27

3

Entering edit mode

6.4 years ago

Kevin Blighe 87k

Those problematic Ensembl IDs are ones that have mappings to multiple Entrez (Refseq) IDs. You also make your situation difficult by converting your list to an array, then a data-frame, and then a matrix. You'll notice that these multiple mappings, based on the way that you've processed the data and just before your matrix conversion step, are separated by a comma in each entry. Using your code:

xx <- as.list(org.Mm.egENSEMBL2EG)
xx_table <- as.array(xx)
xx_table <- as.data.frame(xx_table)
tail(xx_table, 51)
                               xx_table
ENSMUSG00000090744            102642386
ENSMUSG00000095717 102642706, 102642868
ENSMUSG00000072915            102642717

Thus, the subsequent conversion of this to a data-matrix (which only allows numerical values) trips up because it sees the comma and doesn't know what to do with it. The as.matrix function neither gives a warning, but it should.

The most efficient way to convert between a list and a data-frame is with do.call, as you'll see in my code below:

xx <- as.list(org.Mm.egENSEMBL2EG)
xx_table <- do.call(rbind, lapply(xx, data.frame, stringsAsFactors=FALSE))
xx_table <- data.frame(rownames(xx_table), xx_table)

You'll then begin to see the issue:

tail(xx_table,55)
                       rownames.xx_table.    X..i..
ENSMUSG00000094556     ENSMUSG00000094556 102641780
ENSMUSG00000095508     ENSMUSG00000095508 102641863
ENSMUSG00000103587     ENSMUSG00000103587 102642162
ENSMUSG00000090744     ENSMUSG00000090744 102642386
ENSMUSG00000095717.1 ENSMUSG00000095717.1 102642706
ENSMUSG00000095717.2 ENSMUSG00000095717.2 102642868
ENSMUSG00000072915     ENSMUSG00000072915 102642717

Here, ENSMUSG00000095717 has a mapping to 2 Entrez IDs and do.call (coupled with data.frame) has renamed the IDs to make them unique. We can tidy these up with gsub and then finish the remainder of the code:

xx_table[,1] <- gsub("\\.[0-9]*$", "", xx_table[,1])
write.csv(xx_table, "~/ens2ncbi.csv")
ens2ncbi <- read.csv(file="~/ens2ncbi.csv")
ens2ncbi <-ens2ncbi[, 3:2]
colnames(ens2ncbi) <- c("Entrez", "Ensembl")
head(ens2ncbi)
  Entrez            Ensembl
1  11287 ENSMUSG00000030359
2  11298 ENSMUSG00000020804
3  11302 ENSMUSG00000025375
4  11303 ENSMUSG00000015243
5  11304 ENSMUSG00000028125
6  11305 ENSMUSG00000026944

tail(ens2ncbi,52)
         Entrez            Ensembl
24186 102642386 ENSMUSG00000090744
24187 102642706 ENSMUSG00000095717
24188 102642868 ENSMUSG00000095717
24189 102642717 ENSMUSG00000072915
24190 102902673 ENSMUSG00000096370
24191 103164605 ENSMUSG00000102424
24192 104795665 ENSMUSG00000092765
24193 104795666 ENSMUSG00000093246

You'll just have to be wary of this going forward. Many of the merge functions will only take the first match that it finds, which may just be fine.

Kevin

ADD COMMENT • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you very much Kevin, and sorry for always being a pain. As a musician (or should I use "as.musician"??) I always find difficult to deal with this kind of problem and for me the bioinformatics is an ongoing process in a learning by doing, trial and error, stepwise process. I am working on the code and I will let you know if any issues still occur.

ADD REPLY • link 6.4 years ago by Mozart ▴ 330

0

Entering edit mode

Hi Mozart, so, a new symphony is being released soon?! will it be available on Bioconductor?

No problem. The learning process even for me and the most senior Professors is never ending. Should one assume that they already know everything, then they just highlight how little they truly know.

I encounter bugs on an almost daily basis. It is very difficult to account for all eventualities though. Systems like air traffic control systems, though, obviously do have to account for all eventualities. They have different levels of testing than our standard bioinformatics tools though.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

I see; things is, at some point..I mean it's quite frustrating to be blocked by what for you, expert guys, is just a simple issue, anyway..I come from a totally different background and it's like coming back at University spending hours and hours on "silly things". for example I spent all the day long trying to run ReactomePA...and I am still not able to solve the problem... Generally speaking, my strategy is to reproduce the tutorial in order to understand where I "fall". this time was pretty easy as they use in the example

data(geneList)

returning a table like that (so please notice the grey column with the NCBI name and the white column with p value)

and I tried and I tried my best but I was not able to make something better than that (where the grey column tells just the order of both values...I am not even sure whether the column the grey column is a column or just the row name...)

but probably I will sort out tomorrow!

ADD REPLY • link 6.1 years ago by Mozart ▴ 330

0

Entering edit mode

If you have a separate question, it would be a good idea to open a new thread.

It also sounds like you need a guide for the lonely bioinformatician, written by my professional colleague Mick Watson (he's Scottish; I'm Irish - same thing).

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

That's just an automatically-assigned row number. You can ignore it.

In R:

data-frames must have unique rownames
data-matrices do not require unique rownames

You could most likely set your rownames to Entrez IDs with:

rownames(ens2ncbi) <- ens2ncbi$Entrez

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

0

Entering edit mode

thanks so much for your reply, and sorry if I didn't opened another form for another problem, I will keep in mind for the next time (or am I supposed to create a new post for this?); I tried to do so but something unexpected happened because I got this kind of error

Error in `row.names<-.data.frame`(`*tmp*`, value = c( 54192L, : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': '23789'............ecc

so I tried with 'make.names' creating a matrix from scratch and putting the enter as row.names ut I thing it's a way too messy for me. Any suggestions guys?

ADD REPLY • link 6.4 years ago by Mozart ▴ 330

0

Entering edit mode

I guess that I was just giving an example of how to set rownames. You do not actually have to set rowname in this case. You have the data-frame with Ensembl-to-Entrez mappings, with Entrez in the first column and Ensembl in the second. You don't have to set rownames.

The error is produced here because, evidently, we also have the situation where more than one Ensembl ID map to the same Enrez ID. Working across annotations, these issues always occur.

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Thanks very much Kevin, it seems to work smoothly now; I can definitely come up with a new composition now, thanks for give me the right inspiration!

ADD REPLY • link 6.4 years ago by Mozart ▴ 330