Question

Help required to understand Hi-C data and its Usage

0

Entering edit mode

5.0 years ago

rohitsatyam102 ▴ 940

Hi Everyone!

I require some help related to the usage and understanding of Hi-C data. I am trying hard at my personal level to learn it without any prior training but I literally can't understand some parts. For instance, I am unable to understand the matrix obtained post-Hi-C data Analysis as shown here.

I have another list of enhancer coordinates and their associated gene promoters coordinates and I am asked to check if they fall within a TAD boundary or not for the validity of association.

I'll highly appreciate any possible help or direction in form of links or diagrams or whatever the way you can.

Hi-C • 1.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 5.0 years ago by rohitsatyam102 ▴ 940

score 1 · Answer 1 · 2020-07-17

Hi rohitsatyam102,

it is quite simple. You first need to understand the HiC-protocol

Initial conditions: Distant sequences form 3D loop structures, stabilized by proteins.

Cross-linking: Protein-DNA interactions are fixed.
Intra-molecular ligation: distant sequences are joined together in the same DNA molecule.
Extra processing steps + sequencing.

Hi-C protocol

As a result, each sequenced read is composed by 2 sub-parts, mapping to 2 different regions in the genome. For example, assume you have only 4 regions: regA, regB, regC and regD. In you data processing, you count how many reads include regA and regB, regA and regC, regA and regD and so on. This way, you end up with a 4x4 matrix of counts.

AA AB AC AD

BA BB BC BD

CA CB CC CD

DA DB DC DD

As you can see, the matrix is symmetric: counts of AC are the same as counts of CA. So instead of displaying the whole matrix, you may as well simply show the superior or inferior triangular matrix. If you turn around 90 degrees, you end up with the typical Hi-C triangle-like display.

Also, in the link shared, other alternatives to raw reads are mentioned: "here are four score values available in this display: NONE, VC, VC_SQRT, and KR. NONE provides raw, un-normalized counts for the number of interactions between regions. VC, or Vanilla Coverage, normalization (Lieberman-Aiden et al., 2009) and the VC_SQRT variant normalize these count values based on the overall count values for each of the two interacting regions".