Analysis Big Data from Hi-C--> a way to find significant interaction ?
3
0
Entering edit mode
6.4 years ago
Baptiste ▴ 90

Hey everyone.

I have just started an internship in bioinformatic and I have to deal with big data from Hi-C. I want to analyze my data with R.
The data looks like this:---> chrom  start  end  count
I want to build a matrix where each bins fills with the count. And after, select significants interectations and (if it is possible) to plot the heat map.

It works for low resolution (500kb,100kb), but when I try to run my code with high resolution (10kb, 5kb), problems occurs and R doesn't want to compute with big data.
So I try to use a sparse matrix but I can't process all my code with this, I have to transform into an matrix.

So if you have a solution and you have already managed this kind of problem, let me know.
If you have a method to find significant interactions with high resolution, it will be great. =)
Thank you very much,

Baptiste

HiC BigData R Analysis Resolution • 2.8k views
1
Entering edit mode

Can you be a bit more  specific than "R doesn't want to compute" ? Are you running out of memory ? Or getting an error message (which one ?) ? Is it a problem with a package from CRAN ? Or your own code ? Or is it a problem with the data itself ? What kind of operation are you trying to apply ?

0
Entering edit mode

hey

So this is my code:

• file1---> is the rawdata from hic (chrom start end count)
• file2---> is the file to normalize rawdata
• binze---> for the resolution so 500kb=5e5
• dimension---> size of the matrix
MyMatrix <- sparseMatrix(i = file1$V1/binSize + 1, j = file1$V2/ binSize+1, x = file1$V3,dims = c(dimension,dimension)) vector<-file2$V1
MatrixVector <- vector %o% vector
MatrixNorm <- MyMatrix / MatrixVector
as.matrix(MatrixNorm)
MatrixNorm1<-as.matrix(forceSymmetric(MatrixNorm))#I want to have a symmetric matrix for the heatmap.


The real problem is not really my code. But to deal with bigdata in R and find significants interactions between both side of the DNA. I am sorry is the my request was not clear,

Real problem is: Does it exist a way to find significant interaction with high resolution?

0
Entering edit mode

The problem is: I don't have the mapped data. I try to find other software like:

• SeqMonk
• HOMER
• HiClib
• HiBrowse

But these software only work with mapped data in input, and it seems like a lot of work to process with that way (to convert my data).

What do you think?

3
Entering edit mode
6.4 years ago
Fidel ★ 2.0k

To solve your problem with large matrices I recommend you to do your analysis per chromosome. This will dramatically reduce the size of the matrix.

Moreover, which method are you using to identify enriched contacts?

Apart from the problem of handling large matrices in R, I would be concerned that, with increased resolution the statistical power to discern significant contacts is reduced. Be sure that you have sufficient counts per cell in your matrix. This is off course dependent on the depth of sequencing, the final number of usable reads, and the size of the genome.

0
Entering edit mode

Hy Fidel,

Yes I forgot specifying that I only work with intrachromosomal interaction and only one chromosome.

To identify enriched contacts, I use "quantile" to find a threshold, then I apply this threshold to select the values above.

Yes this is a real problem because there are a lot of "NaN" (it means that does not converge) and I have to deal with that. Unfortunately, I can't replace NaN by 0.

0
Entering edit mode

I work with python and so far I didn't have a problem with the matrix size. Maybe you can try with python.

Have you checked the methods to compute long-range contacts by Job Dekker (Sanyal et al. Nature, 2012), Victor Corces (Hou et al. Mol. Cell 2012), Bing Ren (same as Corce's) (Jin et al. Nature 2013) and Lieberman-Aiden (Rao et al. 2014) ?

1
Entering edit mode
6.1 years ago
Bryan Lajoie ▴ 10

You might want to try python + numpy for this.

Though - why do you need to hold the entire genome-wide matrix in memory? Do you need the trans data as well? Or you can do as Fidel suggests and perform your calculations on each chromosome separately? Can you do your calculations in blocks/chunks?

In fact, the matrix format while useful for visualization, is not ideal as a data structure. What about sub-setting by genome distance (then you can remove n-diagonals from the matrix - effectively hidden from memory in sparse format)?

Some recent papers perform a local peak calling, which in effect allows each submatrix to be 'peak-called' independently and thus reduces memory requirements and allows you to compute in parallel! Though you need to think carefully about what you hope to achieve when calling peaks and what your definition of a 'peak/loop' actually means. (global vs local peak calling will produce vastly different results)

Also be aware about the distance bias with any interaction data. Loci close in the linear genome will also be close in the 3D genome and will have the strongest interactions signals. Depending on how you implement your peak calling - you may have to normalize for genome distance first before performing any quantile based peak calling...!

0
Entering edit mode
6.4 years ago
Asaf 8.6k

Have you tried the Bioconductor packages: Bioconductor - GOTHiC and/or Bioconductor - HiTC?

Regardless of these packages you can bin your data according to the restriction enzyme recognition sites which should reduce its complexity (if it's not already binned).