I am new to the field of bioinformatics, and have been learning about Hi-C data processing for a while now. In order to get familiar with the topic, I am processing the dataset from a publication currently under review by some collaborators of ours, expecting to get similar results. Unfortunately, the person in charge of the Hi-C section is currently unavailable, so I'm reaching to you hoping to get some advice. Here's some information about my goals:
- I'm trying to process capture Hi-C data from human chromosome 12, of a 2.3 Mb capture window between 6.1 Mb and 8.4 Mb.
- Following what was done in the publication, I'm mapping the reads to the GRCh37 assembly (I know it's outdated, but it's the one they used).
- I'm using GEM v3 to index the genome and map the reads, and the software TADbit. Specifically, I'm basing my work on the following tutorial: https://3dgenomes.github.io/TADbit/tutorial.html
And about the specific pipeline I follow, it's based on the tutorial above:
- Download the Hi-C .fasta files to process.
- Download the assembly for the reference genome, specifically that of the chromosome 12 (no contigs), from: https://www.ncbi.nlm.nih.gov/nuccore/CM000674.1 .
- Changed the header of the genome assembly to >chr12 (do not know if it's mandatory, but shown in tutorial).
- Indexed the reference genome using gem-indexer as follows: gem-indexer -i genome_assembly.fa -o genome_assembly_indexed -t 4
- Used the full_mapping function in TADbit to perform fragmented mapping of each end of the Hi-C reads to the indexed genome, specifying the restriction enzyme DpnII that was used for the experiment. I believe TADbit also uses gem-mapper in this function.
Then, keeping the work on TADbit:
- Used parsed_fasta to parse the indexed genome.
- Used parse_map to map each read end to the parsed indexed genome. I believe that at this point my process is already having some kind of problem.
- Used get_intersection to identify which read has both ends uniquely mapped to different places in the genome (i.e. valid Hi-C reads).
Then, additional steps of filtering are performed.
After all of this, my interaction matrices look visually similar to that in the publication, with TADs that can be identified similarly to what is seen in the article. However, when I compare the number of mapped reads after step 8, mine is, by very far, much smaller. While the raw data for different samples has between 80M and 120M reads, my uniquely mapped reads range between 5M and 6M, which is just too small. However, in the publication they were able to map between 50M to 80M reads, which I believe are more reasonable numbers. These are numbers for the reads mapped to the whole chromosome, then the capture region is extracted, resulting in a fraction of these mapped reads.
Is any of you familiar with Hi-C data processing or, at least, gem mapping? Or TADbit? If you have encountered similar problems or have an idea of which mistakes a beginner bioinformatician could be making, any suggestion would be more than welcome.
Thanks a lot,
Manuel F. Merino