Question

Merging Matrices From Across Platforms

0

Entering edit mode

13.0 years ago

matthew.tung.1 • 0

Hey BioStar,

I'm trying to combine expression data in R from several experiments performed on different platforms. I have converted all of the probe set names into their corresponding Entrez Gene IDs and then used LIMMA to identify the most powerful probe set for each ID (when there are multiple probe sets mapping to the same gene).

I am planning to construct a single data frame for all of the experiments that contains all of the Entrez Gene IDs examined and all of the samples from the experiments. I thus need a function that will extend existing Entrez IDs' rows and add new rows when I add a new experiment's data set. I had planned to use merge(data, new.data, by=0, all=TRUE) and deal with the NAs later, but I am getting an error telling me "Error in match.names(clabs, names(xi)) : names do not match previous names." A colleague suggested that I cast the data frame as a matrix so that I could merge two matrices, but this doesn't seem to have solved the problem.

Any ideas? Thanks!

• 2.8k views

ADD COMMENT • link updated 13.0 years ago by seidel 11k • written 13.0 years ago by matthew.tung.1 • 0

score 0 · Answer 1 · 2012-06-29

0

Entering edit mode

13.0 years ago

Sean Davis 27k

I'm not answering the question you asked, but I think the question you are trying to ask might be answered by an approach like this one:

http://bioconductor.jp/packages/2.11/bioc/html/virtualArray.html

ADD COMMENT • link 13.0 years ago by Sean Davis 27k

score 0 · Answer 2 · 2012-06-30

I don't know of an elegant or easy way to do this, so I usually do it the dumb brute force way as follows: get all unique Entrez Gene IDs so you can declare a data frame with that many rows, and then loop through your experiments adding columns to the data frame, and matching the correct positions for adding the data:

# create a data frame of NAs that can be populated with data
myData <- as.data.frame(matrix(NA,ncol=ncol(first_experiment),nrow=length(uniqueIDs)))
rownames(myData) <- uniqueIDs

# assuming your first experiment has 5 samples (thus 5 columns), add the data to the data frame
# by creating an index vector to match names between the data sets
iv <- match(rownames(myData), rownames(first_experiment))

# only a subset of first_experiment rows will match, thus there will be NAs
# only update the positions of myData with a match to first_experiment
update_positions <- which(!is.na(iv))

# get rid of NAs in iv
iv <- iv[update_positions]

# populate myData with matching rows from first_experiment
myData[update_positions,] <- first_experiment[iv,]
colnames(myData) <- colnames(first_experiment)

# repeat as needed by appending columns

Since it's easy to add columns to a data frame, you could write a loop for this kind of thing. Remember with R you can add subscripts to some functions, so for instance to update column names as you build out your data frame you can do something like colnames(myData)[17:21] <- colnames(experiment5). I think it's ugly, and perhaps too easy to make an error if you don't check everything explicitly, but when joining a whole bunch of matrices of different sizes, I just don't know of a different way than to build with the end (total number of final rows) in mind. On the other hand you could probably write a function that given two matrices, if one has rows not found in the other, add them, then join them, return the result, and repeat. I'm sure I could make that look ugly too :)