Question

Several questions for genereating human genome assembly

0

Entering edit mode

10 weeks ago

tungsega ▴ 10

Hello,

I am still a beginner in genome assembly. Currently, I am working on generating two haplotype male human genome assemblies. Below is a description of the data I currently have:

~45× Nanopore R10.4.1 simplex reads (basecalled with Dorado sup v5.2.0 models; planning to increase to ~70–80×)
~75× Illumina WGS reads
~20× Illumina Hi-C reads

Workflow in use:

Assembly: hifiasm with the --ont and Hi-C option
Scaffolding and phasing: HapHiC
Polishing: Medaka + NextPolish (two rounds)

Recently, I noticed that Dorado can perform correction on R10.4.1 simplex reads. I also came across several other tools, such as RAFT, which can improve assembly continuity, and NextPolish2, which allows polishing using both short reads and HiFi reads.

I would be grateful for your advice on the following questions:

Can Dorado-corrected reads be considered of high enough quality to be used as HiFi reads in tools that specifically require HiFi data (such as NextPolish2)?
Since hifiasm with the --ont option requires FASTQ input with quality scores for read correction, is it feasible to use Dorado-corrected reads in this mode? If so, would it be appropriate to assign artificial quality scores (such as "I") to these reads?
In your experience, what level of sequencing depth is generally sufficient for Hi-C phasing?
Do Nanopore ultra-long reads substantially improve genome assembly? At present, our assembly still lacks completeness in the acrocentric chromosomes. I am considering whether ultra-long reads could help resolve long repetitive regions, such as rDNA clusters within acrocentric chromosomes, thereby improving the overall assembly quality.
Is it recommended to perform haplotype-aware genome polishing for each haplotype? My plan is to separate the Nanopore and Illumina reads using a phased VCF generated from the following steps:
1. Conduct variant calling for each haplotype (Clair3 for Nanopore reads; GATK HaplotypeCaller for Illumina reads)
2. Merge variants by retaining all those identified from Illumina reads, along with INDELs reported from Nanopore reads.
3. Generate phased variants using whatshap.
4. Use whatshap to separate both Nanopore and Illumina reads into their respective haplotypes based on the phased variants.

Thank you! Regards.

Dorado Genome Assembly hifiasm • 11k views

ADD COMMENT • link updated 6 days ago by colindaven 8.1k • written 10 weeks ago by tungsega ▴ 10

0

Entering edit mode

Can Dorado-corrected reads be considered of high enough quality

If you have SUP reads then they should be of high quality.

At present, our assembly still lacks completeness in the acrocentric chromosomes.

Sounds like you have already done/tried some of the things you mention.

It would be interesting to know if you manage to get a T2T like assembly (which is what you must have in mind). Research papers don't capture the mountain of work (not just the programs run) that goes into the final assembly, Looks like you are trying to accumulate various types of data that are mentioned in such papers.

ADD REPLY • link 10 weeks ago by GenoMax 154k

0

Entering edit mode

Yes, I tried running hifiasm with the --ont option using both the raw reads and the Dorado-corrected reads (with artificial quality "I"). In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T. I did not observe any significant differences between the two results in terms of continuity, QV, or Compleasm evaluation. But I’m not quite sure whether my sequencing depth is sufficient to highlight the differences between these two strategies.

ADD REPLY • link 10 weeks ago by tungsega ▴ 10

0

Entering edit mode

with artificial quality "I"

It is a quality score dorado feels confident to assign based on correction. Original quality scores are an estimate based on how confident the basecaller is so this would be no different.

I did not observe any significant differences between the two results in terms of continuity

So your data may already be of very good quality and the error correction did not result in additional improvement.

In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

ADD REPLY • link 10 weeks ago by GenoMax 154k

0

Entering edit mode

So your data may already be of very good quality and the error correction did not result in additional improvement.

Yes. However, I’m still struggling with the incompleteness of the acrocentric chromosomes and trying to find a way to address this issue.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

What I meant is that in my assemblies, I can observe telomere sequences at both ends of most chromosomes.

ADD REPLY • link 10 weeks ago by tungsega ▴ 10

0

Entering edit mode

Hopefully these clarifications will help elicit an answer from someone familiar with human assemblies.

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)? If the data is missing then no amount of assembly magic is going to work.

ADD REPLY • link 10 weeks ago by GenoMax 154k

0

Entering edit mode

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)?

Yes, mapping to CHM13 shows that the missing regions are indeed covered by reads.

In fact, the assembly results contain some small contigs that are almost entirely repetitive sequences, which failed to scaffold into the main contigs.

My concern is that Illumina Hi-C reads are not sufficient to resolve these regions because of their length limitations. Perhaps Nanopore ultra-long reads could address this issue. We have not yet tested them, so I would like to ask whether anyone has experience using ultra-long reads in this context and can confirm their effectiveness.

ADD REPLY • link 10 weeks ago by tungsega ▴ 10

score 0 · Answer 1 · 2025-11-19

Hello.

I will address each of your questions in order.

Dorado-corrected Nanopore reads do not reach the accuracy of PacBio HiFi reads, which typically exceed Q30. NextPolish2 specifically requires HiFi mapping files and does not support Nanopore reads. Therefore, you cannot use Dorado-corrected reads as HiFi equivalents in NextPolish2.
Hifiasm in --ont mode can process Dorado-corrected reads, as it accepts FASTQ files with quality scores. Assigning artificial quality scores like "I" (Phred 40) is acceptable, since these reflect the correction confidence, similar to basecaller estimates. Your tests showed no major differences, which suggests your data quality is already high.
For Hi-C phasing in human genomes, a sequencing depth of 20-30x is generally sufficient, based on standard workflows for haplotype resolution. Your 20x coverage should work, but increasing it to 30x may improve phasing in repetitive regions.
Nanopore ultra-long reads substantially improve genome assembly by resolving long repetitive regions, including rDNA clusters in acrocentric chromosomes. Studies show they enhance contiguity and completeness in telomeres and centromeres, addressing the incompleteness you observed.
Haplotype-aware polishing is recommended for each haplotype to correct errors specific to each phase. Your plan to generate a phased VCF with Clair3, GATK HaplotypeCaller, and whatshap, then separate reads with whatshap, is appropriate. Consider using Hapo-G for the polishing step, as it incorporates phasing information.

Kevin