R - Pairwise Combinations Of Rows From Table?
3
1
Entering edit mode
11.9 years ago

Hi,

Assume a table as below:

X =

col1    col2    col3
row1    "A"      "0"     "1"
row2    "B"      "2"     "NA"
row3    "C"      "1"     "2"


I select combinations of two rows, using the code below:

pair <- apply(X, 2, combn, m=2)


This returns a matrix of the form:

pair =

[,1] [,2] [,3]
[1,] "A"  "0"  "1"
[2,] "B"  "2"  NA
[3,] "A"  "0"  "1"
[4,] "C"  "1"  "2"
[5,] "B"  "2"  NA
[6,] "C"  "1"  "2"


I wish to iterate over pair, taking two rows at a time, i.e. first isolate [1,] and [2,], then [3,] and [4,] and finally, [5,] and [6,]. These rows will then be passed as arguments to regression models, i.e. lm(Y ~ row[i]*row[j]).

I am dealing with a large dataset. Can anybody advise how to iterate over a matrix two rows at a time, assign those rows to variables and pass as arguments to a function?

Thanks,
S ;-)

Edit: In response to the comments, I should specify that my problem concerns SNP and expression data where I aim to do a pairwise multiple regression analysis (first order regression) in order to assess any possible SNP-SNP interactions that may effect the expression phenotype.

r • 21k views
1
Entering edit mode

I disagree, but being a newbie, I will accept your ruling.

1
Entering edit mode

ok, I re-opened the question

1
Entering edit mode

I suggest to be rather permissive with border-line off-topic questions. What Biostars needs are more good questions. If we close too aggressively, than in particular new people might be discouraged from even asking.

1
Entering edit mode

I think generic methods questions are fine, provided that they are linked to a problem in bioinformatics which is clearly stated (as was added in the edit to the question in this case).

0
Entering edit mode

sorry, this question was off-topic to me. May be http://stats.stackexchange.com/ would be a better place to ask this.

0
Entering edit mode

This is only mildly off-topic, and I don't think this kind of question should be closed. I would guess this question is directly related to a bioinformatics problem being addressed by the poster.

0
Entering edit mode

We shouldn't be rather permissive with border-line on-topic questions. What Biostars needs are more good questions. If we close too aggressively, than in particular new people might be discouraged from even asking.

4
Entering edit mode
11.9 years ago

Using mapply you can map a function over more than one sequence (list or vector) in one pass e.g. the odd and even rows of a matrix

fn <- function(i, j) paste("i =", i, "j =", j)

odds <- seq(1, 9, by = 2)
evens <- odds + 1

mapply(fn, odds, evens)

1
Entering edit mode

+1 for the apply instead of loop approach

1
Entering edit mode

David is right. However, given that Seafoid had gone down that route already, I assumed the dataset was tractable. If the number of combinations were huge, I'd pull them from a generator function in the style of a list comprehension, rather than make them up front.

0
Entering edit mode

I agree that apply is in general a great approach. However, the method suggested by Seafoid and Keith of pre-creating all combinations with massively redundant numbers of rows might not scale to genome-scale numbers of SNPs if you're doing all combinations (say, N=500,000 or more).

0
Entering edit mode

Yes, I applied some stringent filters on my data and reduced from ~700, 000 to a mere 1,800. This is a first pass analysis so in time the filter stringency will be more lenient and thus the number of SNPs is envisaged to be much larger.

2
Entering edit mode
11.9 years ago

A simple idiom for "all pairs of a matrix" M of dim(N, C)

for( i in 1:(N-1) ){
for( j in (i+1):N ){
print(paste(i,j))
}
}


I've had to do this eQTL analysis. For any non-trivial number of genotypes or expression measurements, the naive approach (e.g. building a new lm for each comparison) is too slow. If PHENO is a matrix (not data.frame) of expression values and GENO the set of genotypes for a single locus at all of the samples in PHENO, you want something like

for each GENO
model = lm( t(PHENO)~GENO )


Extend the linear model to match whatever interactions you want to test. You can extract the p-values of coefficients of a solved linear model with

calculate.lm.pval=function(linear.model){
# Given a solved linear model, return the p-values of the coefficients
n.obs = dim(linear.model$residuals)[1] n.models = dim(linear.model$coefficients)[2]
cc = matrix(coef(linear.model), nrow=linear.model$rank, ncol=n.models) se = calculate.linear.model.se(linear.model) t.stat = cc/se rdf = n.obs - linear.model$rank
dm=data.frame( 2*pt(abs(t.stat), rdf, lower.tail = FALSE)  )
names(dm) = as.vector(dimnames(linear.model$coefficients)[[2]]) rownames(dm) = dimnames(linear.model$coefficients)[[1]]
dm
}

0
Entering edit mode
11.9 years ago

If you want to analyze your SNP-SNP interaction in genomic scale, I think the combinations would be too huge for your computer to fulfill. Luckily, there is now a standard package named MDR, with which your can get the most possible SNP-SNP interaction affecting the phenotype. You can download MDR from here http://sourceforge.net/projects/mdr/, and further references about MDR are redundant on internet.