Struggling to work on a large Count Matrix
1
0
Entering edit mode
20 months ago

Hello there,

I'm struggling loading/importing my large count Matrix on Rstudio in order to analyze it. It's a quite medium sized data (3 Giga) but R is crashing every time and my PC want to seppuku itself each time.

So, importing by only making a read.table won't work. I tried stocking it as a big.matrix file, it won't work either, R crashes again.

What can I do? I can't find any nice tutorial for this kind of problem.

R countMatrix single-cell • 656 views
2
Entering edit mode

Is it crashing due to running out of RAM? Have you tried a sparse matrix?

0
Entering edit mode

Yep, the RAM can't keep up.

I was thinking of that but I can't manage to read the file directly into a sparseMatrix, avoiding the read.table step. read.matrix maybe?

2
Entering edit mode

How about doing that sequentially, like in chunks of 10%?

3
Entering edit mode

By the way, don't bother with read.table, it is super slow. Use for example (among many good options) data.table::fread() or readr::readr(). Speed gains are notable.

1
Entering edit mode

Might be a job for {disk.frame} https://github.com/xiaodaigh/disk.frame

0
Entering edit mode

Do you mean "load/import" a big file?

charging my large count Matrix on R

0
Entering edit mode

Yeah, sorry, I was indeed meaning to say to import or load data

0
Entering edit mode

I struggle to see how this is related to bioinformatics, or why it has attracted so many answers. Loads of questions get killed for asking something about biology and maybe tangentially related to bioinformatics. I don't see how this question is related to either.

2
Entering edit mode

Dealing with large data sets has become a more common issue although it is not specific to bioinformatics. However, for bioinformatics data types, there may exist specific tools. Here we're dealing with a count matrix and although replies currently suggest generic solutions, maybe someone has a more specific solution for count matrices as part of their analysis pipeline that they can share.

0
Entering edit mode

Agreed. The single-cell packages are starting to output counts in sparse matrices inside hdf5 containers for this reason, so if one could go back a step in OP's workflow there are likely some tweaks that could be made there to make life easier.

0
Entering edit mode

Maybe I am missing your point. Do you see anything in the question that implies biology or bioinformatics application of what this poster is trying to do?

My point was that lots of posters are turned away even though they sometimes have legitimate biology question that may be related to bioinformatics. To me, that is closer to the intended purpose of this site than the current post.

0
Entering edit mode

Your point is valid, but as it does not add to the content of this thread I suggest we discuss things like that in our Slack, which you are invited to join:

biostar.slack.com: Chat for the biostars community -- [ feel free to join ]

0
Entering edit mode

Well, as it wasn't really important to say why I was needing it, I didn't mention it. But I need to find what I have to do in order to import this complex and large data in R because I need to analyze a large count Matrix issued of single-cell sequencing (split-seq if I want to be precise).

The count matrices issued of the pipelines that analyze the single-cell raw data are huge and in conclusion, R has troubles working on them and needs a lot of RAM.

So, I was just trying to ask around as I can't really find the right method that will help me use R with such large files and I want to do it properly

2
Entering edit mode
20 months ago

Two R packages that may be of interest: