I am new to the field of bioinformatics, and have been learning about Hi-C data processing for a while now. In order to get familiar with the topic, I am processing the dataset from a publication currently under review by some collaborators of ours, expecting to get similar results. Unfortunately, the person in charge of the Hi-C section is currently unavailable, so I'm reaching to you hoping to get some advice. Here's some information about my goals:
- I'm trying to process capture Hi-C data from human chromosome 12, of a 2.3 Mb capture window between 6.1 Mb and 8.4 Mb.
- Following what was done in the publication, I'm mapping the reads to the GRCh37 assembly (I know it's outdated, but it's the one they used).
- I'm using GEM v3 to index the genome and map the reads, and the software TADbit. Specifically, I'm basing my work on the following tutorial: https://3dgenomes.github.io/TADbit/tutorial.html
And about the specific pipeline I follow, it's based on the tutorial above:
- Download the Hi-C .fasta files to process.
- Download the assembly for the reference genome, specifically that of the chromosome 12 (no contigs), from: https://www.ncbi.nlm.nih.gov/nuccore/CM000674.1 .
- Changed the header of the genome assembly to >chr12 (do not know if it's mandatory, but shown in tutorial).
- Indexed the reference genome using gem-indexer as follows: gem-indexer -i genome_assembly.fa -o genome_assembly_indexed -t 4
- Used the full_mapping function in TADbit to perform fragmented mapping of each end of the Hi-C reads to the indexed genome, specifying the restriction enzyme DpnII that was used for the experiment. I believe TADbit also uses gem-mapper in this function.
Then, keeping the work on TADbit:
- Used parsed_fasta to parse the indexed genome.
- Used parse_map to map each read end to the parsed indexed genome. I believe that at this point my process is already having some kind of problem.
- Used get_intersection to identify which read has both ends uniquely mapped to different places in the genome (i.e. valid Hi-C reads).
Then, additional steps of filtering are performed.
After all of this, my interaction matrices look visually similar to that in the publication, with TADs that can be identified similarly to what is seen in the article. However, when I compare the number of mapped reads after step 8, mine is, by very far, much smaller. While the raw data for different samples has between 80M and 120M reads, my uniquely mapped reads range between 5M and 6M, which is just too small. However, in the publication they were able to map between 50M to 80M reads, which I believe are more reasonable numbers. These are numbers for the reads mapped to the whole chromosome, then the capture region is extracted, resulting in a fraction of these mapped reads.
Is any of you familiar with Hi-C data processing or, at least, gem mapping? Or TADbit? If you have encountered similar problems or have an idea of which mistakes a beginner bioinformatician could be making, any suggestion would be more than welcome.
Thanks a lot,
Manuel F. Merino
I would suggest retitling your post since this isn't about sparse matrices (ie matrices with mostly 0s), but about mapping efficiency. Maybe something like "low mapping percentage in Hi-C data" -- that would help you reach the appropriate audience.
Also, are you sure you're not just getting fewer mapped reads because you're focusing on one part of one chromosome? Or did the original paper get 50-60M reads for just that one segment (which feels quite excessive)?
I followed your suggestion regarding the title, thanks a lot! Probably I also wasn't clear about the mapping, too. The numbers that I'm including here are for the whole chromosome 12 (the capture region is enriched but many reads are mapped to the rest of the chromosome, too). That means that my final mapped reads in the capture region are but a fraction of the numbers that I described.