Several questions for genereating human genome assembly
0
0
Entering edit mode
7 days ago
tungsega • 0

Hello,

I am still a beginner in genome assembly. Currently, I am working on generating two haplotype male human genome assemblies. Below is a description of the data I currently have:

  • ~45× Nanopore R10.4.1 simplex reads (basecalled with Dorado sup v5.2.0 models; planning to increase to ~70–80×)
  • ~75× Illumina WGS reads
  • ~20× Illumina Hi-C reads

Workflow in use:

  • Assembly: hifiasm with the --ont and Hi-C option
  • Scaffolding and phasing: HapHiC
  • Polishing: Medaka + NextPolish (two rounds)

Recently, I noticed that Dorado can perform correction on R10.4.1 simplex reads. I also came across several other tools, such as RAFT, which can improve assembly continuity, and NextPolish2, which allows polishing using both short reads and HiFi reads.

I would be grateful for your advice on the following questions:

  1. Can Dorado-corrected reads be considered of high enough quality to be used as HiFi reads in tools that specifically require HiFi data (such as NextPolish2)?
  2. Since hifiasm with the --ont option requires FASTQ input with quality scores for read correction, is it feasible to use Dorado-corrected reads in this mode? If so, would it be appropriate to assign artificial quality scores (such as "I") to these reads?
  3. In your experience, what level of sequencing depth is generally sufficient for Hi-C phasing?
  4. Do Nanopore ultra-long reads substantially improve genome assembly? At present, our assembly still lacks completeness in the acrocentric chromosomes. I am considering whether ultra-long reads could help resolve long repetitive regions, such as rDNA clusters within acrocentric chromosomes, thereby improving the overall assembly quality.
  5. Is it recommended to perform haplotype-aware genome polishing for each haplotype? My plan is to separate the Nanopore and Illumina reads using a phased VCF generated from the following steps:
    1. Conduct variant calling for each haplotype (Clair3 for Nanopore reads; GATK HaplotypeCaller for Illumina reads)
    2. Merge variants by retaining all those identified from Illumina reads, along with INDELs reported from Nanopore reads.
    3. Generate phased variants using whatshap.
    4. Use whatshap to separate both Nanopore and Illumina reads into their respective haplotypes based on the phased variants.

Thank you! Regards.

Dorado Genome Assembly hifiasm • 9.7k views
ADD COMMENT
0
Entering edit mode

Can Dorado-corrected reads be considered of high enough quality

If you have SUP reads then they should be of high quality.

At present, our assembly still lacks completeness in the acrocentric chromosomes.

Sounds like you have already done/tried some of the things you mention.

It would be interesting to know if you manage to get a T2T like assembly (which is what you must have in mind). Research papers don't capture the mountain of work (not just the programs run) that goes into the final assembly, Looks like you are trying to accumulate various types of data that are mentioned in such papers.

ADD REPLY
0
Entering edit mode

Yes, I tried running hifiasm with the --ont option using both the raw reads and the Dorado-corrected reads (with artificial quality "I"). In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T. I did not observe any significant differences between the two results in terms of continuity, QV, or Compleasm evaluation. But I’m not quite sure whether my sequencing depth is sufficient to highlight the differences between these two strategies.

ADD REPLY
0
Entering edit mode

with artificial quality "I"

It is a quality score dorado feels confident to assign based on correction. Original quality scores are an estimate based on how confident the basecaller is so this would be no different.

I did not observe any significant differences between the two results in terms of continuity

So your data may already be of very good quality and the error correction did not result in additional improvement.

In both cases, almost all chromosomes, except for the acrocentric ones, were assembled to T2T.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

ADD REPLY
0
Entering edit mode

So your data may already be of very good quality and the error correction did not result in additional improvement.

Yes. However, I’m still struggling with the incompleteness of the acrocentric chromosomes and trying to find a way to address this issue.

Did you mean to say aligned to T2T reference or were the reads assembled using "T2T reference guided" assembly?

What I meant is that in my assemblies, I can observe telomere sequences at both ends of most chromosomes.

ADD REPLY
0
Entering edit mode

Hopefully these clarifications will help elicit an answer from someone familiar with human assemblies.

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)? If the data is missing then no amount of assembly magic is going to work.

ADD REPLY
0
Entering edit mode

Do you have reads in your data that map to the missing regions (indicating that assembly may be the problem, even though the data is there)?

Yes, mapping to CHM13 shows that the missing regions are indeed covered by reads.

In fact, the assembly results contain some small contigs that are almost entirely repetitive sequences, which failed to scaffold into the main contigs.

My concern is that Illumina Hi-C reads are not sufficient to resolve these regions because of their length limitations. Perhaps Nanopore ultra-long reads could address this issue. We have not yet tested them, so I would like to ask whether anyone has experience using ultra-long reads in this context and can confirm their effectiveness.

ADD REPLY

Login before adding your answer.

Traffic: 5834 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6