Several questions for genereating human genome assembly
1
0
Entering edit mode
10 weeks ago
tungsega ▴ 10

Hello,

I am still a beginner in genome assembly. Currently, I am working on generating two haplotype male human genome assemblies. Below is a description of the data I currently have:

  • ~45× Nanopore R10.4.1 simplex reads (basecalled with Dorado sup v5.2.0 models; planning to increase to ~70–80×)
  • ~75× Illumina WGS reads
  • ~20× Illumina Hi-C reads

Workflow in use:

  • Assembly: hifiasm with the --ont and Hi-C option
  • Scaffolding and phasing: HapHiC
  • Polishing: Medaka + NextPolish (two rounds)

Recently, I noticed that Dorado can perform correction on R10.4.1 simplex reads. I also came across several other tools, such as RAFT, which can improve assembly continuity, and NextPolish2, which allows polishing using both short reads and HiFi reads.

I would be grateful for your advice on the following questions:

  1. Can Dorado-corrected reads be considered of high enough quality to be used as HiFi reads in tools that specifically require HiFi data (such as NextPolish2)?
  2. Since hifiasm with the --ont option requires FASTQ input with quality scores for read correction, is it feasible to use Dorado-corrected reads in this mode? If so, would it be appropriate to assign artificial quality scores (such as "I") to these reads?
  3. In your experience, what level of sequencing depth is generally sufficient for Hi-C phasing?
  4. Do Nanopore ultra-long reads substantially improve genome assembly? At present, our assembly still lacks completeness in the acrocentric chromosomes. I am considering whether ultra-long reads could help resolve long repetitive regions, such as rDNA clusters within acrocentric chromosomes, thereby improving the overall assembly quality.
  5. Is it recommended to perform haplotype-aware genome polishing for each haplotype? My plan is to separate the Nanopore and Illumina reads using a phased VCF generated from the following steps:
    1. Conduct variant calling for each haplotype (Clair3 for Nanopore reads; GATK HaplotypeCaller for Illumina reads)
    2. Merge variants by retaining all those identified from Illumina reads, along with INDELs reported from Nanopore reads.
    3. Generate phased variants using whatshap.
    4. Use whatshap to separate both Nanopore and Illumina reads into their respective haplotypes based on the phased variants.

Thank you! Regards.

Dorado Genome Assembly hifiasm • 11k views
ADD COMMENT
0
Entering edit mode

Can Dorado-corrected reads be considered of high enough quality

If you have SUP reads then they should be of high quality.

At present, our assembly still lacks completeness in the acrocentric chromosomes.

Sounds like you have already done/tried some of the things you mention.

It would be interesting to know if you manage to get a T2T like assembly (which is what you must have in mind). Research papers don't capture the mountain of work (not just the programs run) that goes into the final assembly, Looks like you are trying to accumulate various types of data that are mentioned in such papers.

ADD REPLY
0
Entering edit mode

Yes, I tried running hifiasm with the --ont option using both the raw reads and the Dorado-corrected reads (with artificial quality "I"). In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T. I did not observe any significant differences between the two results in terms of continuity, QV, or Compleasm evaluation. But I’m not quite sure whether my sequencing depth is sufficient to highlight the differences between these two strategies.

ADD REPLY
0
Entering edit mode

with artificial quality "I"

It is a quality score dorado feels confident to assign based on correction. Original quality scores are an estimate based on how confident the basecaller is so this would be no different.

I did not observe any significant differences between the two results in terms of continuity

So your data may already be of very good quality and the error correction did not result in additional improvement.

In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

ADD REPLY
0
Entering edit mode

So your data may already be of very good quality and the error correction did not result in additional improvement.

Yes. However, I’m still struggling with the incompleteness of the acrocentric chromosomes and trying to find a way to address this issue.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

What I meant is that in my assemblies, I can observe telomere sequences at both ends of most chromosomes.

ADD REPLY
0
Entering edit mode

Hopefully these clarifications will help elicit an answer from someone familiar with human assemblies.

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)? If the data is missing then no amount of assembly magic is going to work.

ADD REPLY
0
Entering edit mode

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)?

Yes, mapping to CHM13 shows that the missing regions are indeed covered by reads.

In fact, the assembly results contain some small contigs that are almost entirely repetitive sequences, which failed to scaffold into the main contigs.

My concern is that Illumina Hi-C reads are not sufficient to resolve these regions because of their length limitations. Perhaps Nanopore ultra-long reads could address this issue. We have not yet tested them, so I would like to ask whether anyone has experience using ultra-long reads in this context and can confirm their effectiveness.

ADD REPLY
0
Entering edit mode
7 days ago
Kevin Blighe ★ 90k

Hello.

I will address each of your questions in order.

  1. Dorado-corrected Nanopore reads do not reach the accuracy of PacBio HiFi reads, which typically exceed Q30. NextPolish2 specifically requires HiFi mapping files and does not support Nanopore reads. Therefore, you cannot use Dorado-corrected reads as HiFi equivalents in NextPolish2.

  2. Hifiasm in --ont mode can process Dorado-corrected reads, as it accepts FASTQ files with quality scores. Assigning artificial quality scores like "I" (Phred 40) is acceptable, since these reflect the correction confidence, similar to basecaller estimates. Your tests showed no major differences, which suggests your data quality is already high.

  3. For Hi-C phasing in human genomes, a sequencing depth of 20-30x is generally sufficient, based on standard workflows for haplotype resolution. Your 20x coverage should work, but increasing it to 30x may improve phasing in repetitive regions.

  4. Nanopore ultra-long reads substantially improve genome assembly by resolving long repetitive regions, including rDNA clusters in acrocentric chromosomes. Studies show they enhance contiguity and completeness in telomeres and centromeres, addressing the incompleteness you observed.

  5. Haplotype-aware polishing is recommended for each haplotype to correct errors specific to each phase. Your plan to generate a phased VCF with Clair3, GATK HaplotypeCaller, and whatshap, then separate reads with whatshap, is appropriate. Consider using Hapo-G for the polishing step, as it incorporates phasing information.

Kevin

ADD COMMENT
0
Entering edit mode

Regarding your point 1, Dorado SUP called ONT 10.4.1 reads are now typically Q26. The is pretty close to Q30. That is prior to Dorado correct.

I have previously (early 2024, so with Q20+ data, not Q26) measured dorado corrected reads (SUP called) on alignment with a genome from the same genotype using tools like BEST from google and others. This gave me estimates of Q33 for an accuracy distribution. It is likely higher now.

So all this empirical and direct evidence makes me think your point 1 is incorrect, at least in terms of direct measurement of quality score distributions.

Would I use Nextpolish2 to polish an assembly if it requires HiFi specifically ? Probably not. Would it work in terms of quality ? Maybe - but that may depend on motif accuracy more than general accuracy.

ONT has made great leaps and bounds in quality in the last year (again). With the long read lengths, it is definitely advantageous to Pacbio now for complex genome assemblies (humans are not particularly complex).

ADD REPLY

Login before adding your answer.

Traffic: 2928 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6