Question

R - Pairwise Combinations Of Rows From Table?

1

Entering edit mode

14.7 years ago

Darren J. Fitzpatrick ★ 1.1k

Hi,

Assume a table as below:

X = 

        col1    col2    col3
row1    "A"      "0"     "1"
row2    "B"      "2"     "NA"
row3    "C"      "1"     "2"

I select combinations of two rows, using the code below:

pair <- apply(X, 2, combn, m=2)

This returns a matrix of the form:

pair = 

     [,1] [,2] [,3]
[1,] "A"  "0"  "1" 
[2,] "B"  "2"  NA  
[3,] "A"  "0"  "1" 
[4,] "C"  "1"  "2" 
[5,] "B"  "2"  NA  
[6,] "C"  "1"  "2"

I wish to iterate over pair, taking two rows at a time, i.e. first isolate [1,] and [2,], then [3,] and [4,] and finally, [5,] and [6,]. These rows will then be passed as arguments to regression models, i.e. lm(Y ~ row[i]*row[j]).

I am dealing with a large dataset. Can anybody advise how to iterate over a matrix two rows at a time, assign those rows to variables and pass as arguments to a function?

Thanks,
S ;-)

Edit: In response to the comments, I should specify that my problem concerns SNP and expression data where I aim to do a pairwise multiple regression analysis (first order regression) in order to assess any possible SNP-SNP interactions that may effect the expression phenotype.

r • 23k views

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 14.7 years ago by Darren J. Fitzpatrick ★ 1.1k

1

Entering edit mode

I disagree, but being a newbie, I will accept your ruling.

ADD REPLY • link 14.7 years ago by Darren J. Fitzpatrick ★ 1.1k

1

Entering edit mode

ok, I re-opened the question

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

I suggest to be rather permissive with border-line off-topic questions. What Biostars needs are more good questions. If we close too aggressively, than in particular new people might be discouraged from even asking.

ADD REPLY • link 14.7 years ago by Michael 56k

1

Entering edit mode

I think generic methods questions are fine, provided that they are linked to a problem in bioinformatics which is clearly stated (as was added in the edit to the question in this case).

ADD REPLY • link 14.7 years ago by Neilfws 49k

0

Entering edit mode

sorry, this question was off-topic to me. May be http://stats.stackexchange.com/ would be a better place to ask this.

ADD REPLY • link 14.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

This is only mildly off-topic, and I don't think this kind of question should be closed. I would guess this question is directly related to a bioinformatics problem being addressed by the poster.

ADD REPLY • link 14.7 years ago by Bio_X2Y ★ 4.4k

0

Entering edit mode

We shouldn't be rather permissive with border-line on-topic questions. What Biostars needs are more good questions. If we close too aggressively, than in particular new people might be discouraged from even asking.

ADD REPLY • link 14.7 years ago by Michael 56k

0

Entering edit mode

14.7 years ago

Cheng Zhongshan ▴ 400

If you want to analyze your SNP-SNP interaction in genomic scale, I think the combinations would be too huge for your computer to fulfill. Luckily, there is now a standard package named MDR, with which your can get the most possible SNP-SNP interaction affecting the phenotype. You can download MDR from here http://sourceforge.net/projects/mdr/, and further references about MDR are redundant on internet.

ADD COMMENT • link 14.7 years ago by Cheng Zhongshan ▴ 400

Ram · Accepted Answer · 2010-11-16

4

Entering edit mode

14.7 years ago

biobot 0.0.77.a.1099 6.2k

Using mapply you can map a function over more than one sequence (list or vector) in one pass e.g. the odd and even rows of a matrix

fn <- function(i, j) paste("i =", i, "j =", j)

odds <- seq(1, 9, by = 2)
evens <- odds + 1

mapply(fn, odds, evens)

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 14.7 years ago by biobot 0.0.77.a.1099 6.2k

1

Entering edit mode

+1 for the apply instead of loop approach

ADD REPLY • link 14.7 years ago by Neilfws 49k

1

Entering edit mode

David is right. However, given that Seafoid had gone down that route already, I assumed the dataset was tractable. If the number of combinations were huge, I'd pull them from a generator function in the style of a list comprehension, rather than make them up front.

ADD REPLY • link 14.7 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

I agree that apply is in general a great approach. However, the method suggested by Seafoid and Keith of pre-creating all combinations with massively redundant numbers of rows might not scale to genome-scale numbers of SNPs if you're doing all combinations (say, N=500,000 or more).

ADD REPLY • link 14.7 years ago by David Quigley 11k

0

Entering edit mode

Yes, I applied some stringent filters on my data and reduced from ~700, 000 to a mere 1,800. This is a first pass analysis so in time the filter stringency will be more lenient and thus the number of SNPs is envisaged to be much larger.

ADD REPLY • link 14.7 years ago by Darren J. Fitzpatrick ★ 1.1k

Ram · Accepted Answer · 2010-11-16

A simple idiom for "all pairs of a matrix" M of dim(N, C)

for( i in 1:(N-1) ){
  for( j in (i+1):N ){
    print(paste(i,j))
  }
}

I've had to do this eQTL analysis. For any non-trivial number of genotypes or expression measurements, the naive approach (e.g. building a new lm for each comparison) is too slow. If PHENO is a matrix (not data.frame) of expression values and GENO the set of genotypes for a single locus at all of the samples in PHENO, you want something like

for each GENO
  model = lm( t(PHENO)~GENO )

Extend the linear model to match whatever interactions you want to test. You can extract the p-values of coefficients of a solved linear model with

calculate.lm.pval=function(linear.model){
    # Given a solved linear model, return the p-values of the coefficients
    n.obs = dim(linear.model$residuals)[1]
    n.models = dim(linear.model$coefficients)[2]
    cc = matrix(coef(linear.model), nrow=linear.model$rank, ncol=n.models)
    se = calculate.linear.model.se(linear.model)
    t.stat = cc/se
    rdf = n.obs - linear.model$rank
    dm=data.frame( 2*pt(abs(t.stat), rdf, lower.tail = FALSE)  )
    names(dm) = as.vector(dimnames(linear.model$coefficients)[[2]])
    rownames(dm) = dimnames(linear.model$coefficients)[[1]]
    dm
}