1)What is the correct matrix format before converting to BEDPE?
The format you're getting is correct. See below for details.
2)Should the output include chromosome names, or do I need to add them manually?
You need to add them manually. See below for an example of how to do this.
3)Are there any additional steps I should take to format the data correctly for BEDPE conversion?
Yeah. There are a bunch of ways to do this. See below for an example.
The Juicer Tools (juicer_tools_2.13.06.jar
) dump
command outputs observed matrices in sparse matrix format with three columns:
start1 start2 contact_frequency
So, your output is expected and correctly formatted.
Let's break down your call to Juicer Tools dump
:
java -jar juicer_tools_2.13.06.jar dump \
observed NONE contact_matrix_GRCh38.hic \
1 2 BP 1000000 \
chr1_chr2.matrix
You specified the following:
- Observed contacts:
observed
- Unadjusted (non-normalized) contacts:
NONE
- Interchromosomal contacts between chromosomes 1 and 2:
1 2
- Binned at 1,000,000-bp resolution:
BP 1000000
- Input file:
contact_matrix_GRCh38.hic
- Output file:
chr1_chr2.matrix
The first four lines of output look like this:
0 0 13.0
1000000 0 79.0
2000000 0 99.0
3000000 0 153.0
...
Each row represents a contact between two bins:
- Column 1 (
start1
): Start position of a bin in chromosome 1
- Column 2 (
start2
): Start position of a bin in chromosome 2
- Column 3 (
contact_frequency
): Number of Hi-C contacts (unadjusted because NONE
) between these two bins
Using this info, you can convert the sparse matrix into a BEDPE file with AWK; for example:
# Set some parameters
hic="contact_matrix_GRCh38.hic"
chr_a="1"
chr_b="2"
bin=1000000
# Extract interchromosomal contacts between chromosomes 1 and 2
java -jar juicer_tools.jar dump \
observed NONE "${hic}" \
"${chr_a}" "${chr_b}" BP "${bin}" \
"${chr_a}_${chr_b}.matrix"
# Convert sparse matrix to BEDPE format
awk \
-v chr_a="${chr_a}" \
-v chr_b="${chr_b}" \
-v bin="${bin}" \
'BEGIN { OFS="\t" } {
start_1 = $1;
end1 = start_1 + bin;
start_2 = $2;
end2 = start_2 + bin;
score = $3;
print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
}' \
"${chr_a}_${chr_b}.matrix" \
> "${chr_a}_${chr_b}.bedpe"
The above code uses the chromosome and bin information to fill in and wrangle the sparse matrix data, generating a BEDPE file with the following format:
chrom_a start_a end_a chrom_b start_b end_b name score strand_a strand_b
It is standard to assign .
(periods) to name
, strand_a
, and strand_b
when those column are unused. See here for more details on the BEDPE format.
If you want to generate BEDPE files for all interchromosomal pairs while avoiding redundant ones (e.g., skipping 2
by 1
since 1
by 2
is already processed), you can do this through Shell scripting; for example:
# Function for sparse matrix generation
function generate_sparse() {
local jar="${1}" # Path to the Juicer Tools .jar file
local cntc="${2}" # Contact type, e.g., "observed" or "expected"
local nrm="${3}" # Normalization type, e.g., "NONE", "SCALE", etc.‡
local hic="${4}" # Path to .hic input file
local chr_a="${5}" # First chromosome
local chr_b="${6}" # Second chromosome
local typ="${7}" # "BP" or "FRAG"
local bin="${8}" # Bin size in bp
local mat="${9}" # Path to .matrix output file
java -jar "${jar}" dump \
"${cntc}" "${nrm}" "${hic}" \
"${chr_a}" "${chr_b}" "${typ}" "${bin}" \
"${mat}"
} # ‡For "balanced" contacts, use "KR" for Juicer v1, "SCALE" for Juicer v2
# Function for AWK conversion from sparse matrix to BEDPE
function convert_sparse_bedpe() {
local chr_a="${1}" # First chromosome
local chr_b="${2}" # Second chromosome
local bin="${3}" # Bin size in bp
local mat="${4}" # Path to .matrix input file
local bedpe="${5}" # Path to .bedpe output file
awk \
-v chr_a="${chr_a}" \
-v chr_b="${chr_b}" \
-v bin="${bin}" \
'BEGIN { OFS="\t" } {
start_1 = $1;
end1 = start_1 + bin;
start_2 = $2;
end2 = start_2 + bin;
score = $3;
print chr_a, start_1, end1, chr_b, start_2, end2, ".", score, ".", "."
}' \
"${mat}" \
> "${bedpe}"
}
# Function to check file existence (e.g., MATRIX, BEDPE files)
function check_file() {
local fil="${1}"
local nam="${2}"
if [[ ! -s "${fil}" ]]; then
echo "Error: ${nam} file was not generated: '${fil}'." >&2
return 1
fi
}
# Define parameters (change as needed)
jar="${HOME}/path/to/juicer_tools_2.13.06.jar"
hic="${HOME}/path/to/contact_matrix_GRCh38.hic"
cntc="observed"
nrm="NONE"
typ="BP"
bin=1000000
out_mat="${HOME}/path/to/output/matrices"
out_bdp="${HOME}/path/to/output/bedpes"
# Create an array of chromosomes to process
chrs=( $(seq 1 22) X )
# Create output directories
mkdir -p "${out_mat}" "${out_bdp}"
# Run serial BEDPE file generation, avoiding redundant pairings
for ((i = 0; i < ${#chrs[@]}; i++)); do
for ((j = i + 1; j < ${#chrs[@]}; j++)); do
chr_a="${chrs[i]}"
chr_b="${chrs[j]}"
fil_mat="${out_mat}/${chr_a}_${chr_b}.matrix"
fil_bdp="${out_bdp}/${chr_a}_${chr_b}.bedpe"
# Generate sparse matrix
generate_sparse \
"${jar}" "${cntc}" "${nrm}" "${hic}" \
"${chr_a}" "${chr_b}" "${typ}" "${bin}" \
"${fil_mat}"
# Check that MATRIX file was created
check_file "${fil_mat}" "MATRIX"
# Convert sparse matrix to BEDPE
convert_sparse_bedpe \
"${chr_a}" "${chr_b}" "${bin}" \
"${fil_mat}" "${fil_bdp}"
# Check that BEDPE file was created
check_file "${fil_bdp}" "BEDPE"
done
done
To speed things up, you can use GNU Parallel to distribute execution across multiple CPU cores. Note that the below _does not_ avoid redundant pairs (e.g., 1
by 2
and 2
by 1
will both be processed).
## OPTIONAL: Run parallel generation of MATRIX and BEDPE with GNU Parallel ##
# Export functions for use in GNU Parallel
export -f generate_sparse
export -f convert_sparse_bedpe
job=4 ## Adjust based on system resources ##
## WARNING: Redundant pairs will be processed, e.g., both 1 by 2 and 2 by 1 ##
parallel --jobs "${job}" \
'generate_sparse {1} {2} {3} {4} {5} {6} {7} {8} {9}/{5}_{6}.matrix'
::: "${jar}" \
::: "${cntc}" \
::: "${nrm}" \
::: "${hic}" \
::: "${chrs[@]}" \
:::+ "${chrs[@]}" \
::: "${typ}" \
::: "${bin}" \
::: "${out_mat}"
parallel --jobs "${job}" \
'convert_sparse_bedpe {1} {2} {3} {4}/{1}_{2}.matrix {5}/{1}_{2}.bedpe' \
::: "${chrs[@]}" \
:::+ "${chrs[@]}" \
::: "${bin}" \
::: "${out_mat}" \
::: "${out_bdp}"
helpful I meant **