Question: Binding several large matrices by column
0
gravatar for yogacacarolom
2.0 years ago by
Brazil
yogacacarolom40 wrote:

I truly know that 'large matrix issue' is a recurrent topic here, but I would like to explain minutely my specific problem regarding large matrices.   

Strictly speaking, I want to `cbind` several large matrices with a specific name pattern in R. The below code shows my best try until this point.

First lets produce files to mimetize my real matrices:

    # The df1
    df1 <- '######## infx infx infx
    ######## infx infx infx
    probeset_id sample1 sample2 sample3
    PR01           1       2       0
    PR02           -1      2       0
    PR03            2      1       1
    PR04           1       2       1
    PR05           2       0       1'
    df1 <- read.table(text=df1, header=T, skip=2)
    write.table(df1, "df1.txt", col.names=T, row.names=F, quote=F, sep="\t")
    
    # The df2
    df2 <- '######## infx infx infx
    ######## infx infx infx
    probeset_id sample4 sample5 sample6
    PR01           2       2       1
    PR02           2      -1       0
    PR03            2      1       1
    PR04           1       2       1
    PR05           0       0       1'
    df2 <- read.table(text=df2, header=T, skip=2)
    write.table(df2, "df2.txt", col.names=T, row.names=F, quote=F, sep="\t")
    
    # The dfn
    dfn <- '######## infx infx infx
    ######## infx infx infx
    probeset_id samplen1 samplen2 samplen3
    PR01           2       -1       1
    PR02           1      -1       0
    PR03            2      1       1
    PR04           1       2       -1
    PR05           0       2       1'
    dfn <- read.table(text=dfn, header=T, skip=2)
    write.table(dfn, "dfn.txt", col.names=T, row.names=F, quote=F, sep="\t")

Then import it to R and write as my expected `output` file:

    ### Importing and excluding duplicated 'probeset_id' collumn
    calls = list.files(pattern="*.txt")
    library(data.table)
    calls = lapply(calls, fread, header=T)
    mycalls <- as.data.frame(calls)
    probenc <- as.data.frame(mycalls[,1])
    mycalls <- mycalls[, -grep("probe", colnames(mycalls))]
    output <- cbind(probenc, mycalls)
    names(output)[1] <- "probeset_id"
    write.table(output, "output.txt", col.names=T, row.names=F, quote=F, sep="\t")

How the output looks like:

    > head(output)
      probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3
    1        PR01       1       2       0       2       2       1        2       -1        1
    2        PR02      -1       2       0       2      -1       0        1       -1        0
    3        PR03       2       1       1       2       1       1        2        1        1
    4        PR04       1       2       1       1       2       1        1        2       -1
    5        PR05       2       0       1       0       0       1        0        2        1

This code works perfectly for what I want to do, however, I face the known R memory limitation using my real data (more than 30 "`df`" objects with ~1.3GB or/and 600k rows by 100 collumns each).

I read about a very interesting SQL approach (http://stackoverflow.com/questions/4761073/r-how-to-rbind-two-huge-data-frames-without-running-out-of-memory) but I am inexperienced in SQL and did not found a way to adaptate it to my case.

Cheers,

 

matrix big-data R • 937 views
ADD COMMENTlink modified 2.0 years ago by Pierre Lindenbaum98k • written 2.0 years ago by yogacacarolom40

A big part of bioinformatics, or computer science in general, is finding the right way to look at a problem such that it is solvable with the tools we have.  You've seen, the simple solution in R starts to fail when the matrices get too large. No matter how  much system ram you buy, we'll always get a bigger datafile sooner or later. 

Below, Pierre has suggested a file concatenation outside of R. I think it will solve your issue today. But if those matrices are taking up half of your storage capacity, then making a duplicate in the new form is not possible. Or if this is a repetitive task, that slow intermediate step might be a showstopper. As I said before, you'll always get a bigger data file. What you have to do is outsmart the problem. Think about what you're trying to do with the data and it might be possible to skip this large-matrix stage entirely. 

If you want to know how many elements are greater than X, you could do that on each submatrix separately. If you want to know the row-means, you can also do those separately. I see these are genotypes, so you could stratify rows by chromosome, therefore reducing the matrix size by about 20x. Maybe you're doing an imputation that needs all columns together, but that's really a local process that could be windowed by every thousand or so rows. If you're trying to do a PCA on the whole population, try a few thousand randomly selected rows of each matrix (to get it small enough to cbind in R). The final outcome should be substantially the same in the first few dominant PCs.   That sort of thing. 

ADD REPLYlink written 2.0 years ago by karl.stamm3.2k
2
gravatar for Pierre Lindenbaum
2.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

just a simple bash loop with linux join ?

$ sort  -k1,1n  file1.txt > sorted.txt &&  for F in file2 file3; do sort  -k1,1n ${F}.txt | join -1 1 -2 1 sorted.txt -  > tmp && mv tmp sorted.txt;  done  && cat sorted.txt

PR01 2 2 1 2 2 1 2 -1 1
PR02 2 -1 0 2 -1 0 1 -1 0
PR03 2 1 1 2 1 1 2 1 1
PR04 1 2 1 1 2 1 1 2 -1
PR05 0 0 1 0 0 1 0 2 1
probeset_id sample1 sample2 sample3 sample4 sample5 sample6 samplen1 samplen2 samplen3

 

ADD COMMENTlink written 2.0 years ago by Pierre Lindenbaum98k

To clarify on your Bash solution Pierre, yogacacarolom would modify the "file2 file3" portion "for F in file2 file3" to use wildcards to match all files (except file1), so she could join an arbitrarily large number of files, correct?

Also, that "cat sorted.txt" is solely to print output at the end, so if one doesn't want to print out the entire file, omit the end ("&& cat sorted.txt"), right?

ADD REPLYlink written 2.0 years ago by Joseph Pearson360

'yes' and 'yes' :-)

ADD REPLYlink written 2.0 years ago by Pierre Lindenbaum98k

Hi Pierre, This works perfect and thanks for the solution. I am running in to a similar issue as mentioned in this post. It is just that I want to match the files based on chr and bp position of different files and create the matrix. So I took the liberty of tweaking your solution and here is the syntax. in the input file I am having the chr number in the first field and the bp position in the second field and the values in the third field

sort -n -k1.4n -k2,2n DNase_E003.hits.epigen.input > sorted.txt && for F in *.hits.epigen.input; do sort -n -k1.4n -k2,2n ${F} | join -1 1 -2 1 sorted.txt - >> tmp && mv tmp sorted.txt; done;

Running this script gives me the output,but printing also all the fields (chr, bp_pos) from each of the input file. Is there a way to omit this and also the syntax prompts for whether the sorted.txt can be overwritten repeatedly. So for a large list of files I am getting a larger list of prompts. Will be great to hear your input on this.

ADD REPLYlink written 9 months ago by muralinmars40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 691 users visited in the last hour