Hello,
I am trying to reconstruct the reference information for chm13 HiFi reads. In more detail, given the fastq file of real human data and the reference file I want to construct a fasta file that contains the same reads but includes additional information about their position in the reference. My workflow looks like follows:
I download the dataset, in my case it is 'SRR11292120_3_subreads.fastq.gz' from the CHM13 website for the reads, and the full assembly 'chm13.draft_v1.1.fasta' as the reference.
I use Hifiasm https://github.com/chhylp123/hifiasm to error-correct the reads of the dataset with:
./hifiasm -o real_corrected -t 32 --write-paf --write-ec SRX5633451.fastq
I use https://github.com/lh3/minimap2 (I also tested Winnowmap2, which leads to similar results)
minimap2 -d ref.mmi chm13.draft_v1.1.fasta
minimap2 -cx map-hifi ref.mmi real_corrected.ec.fa > alignment.paf
I wrote a script to take the resulting paf file alignment.paf and the error-corrected fasta file and create a new fasta file with the annotated start, end and strand information from the resulting paf file of the minimap alignment. For this, I take for every read where minimap found one or more alignments, only the best alignment and annotate the respective reference position to the read.
I check if the resulting alignment covers the whole chromosome. The result is that some chromosomes are completely covered, but some other chromosomes have between 1 and 4 gaps with a size of a few thousand base pairs.
What do you think about this workflow? Is there any way to improve the results, maybe with different parameters or different/ additional tools?