Question

Hardware requirements for de novo assembly of a small plant genome (<1 Gb) with PacBio HiFi

0

Entering edit mode

2.2 years ago

antmantras ▴ 80

Hi all.

I would like to know the hardware requirements for de novo assembly for a relatively small plant genome (around 800 Mb). I have seen older posts but with the new assemblers and the improvements they have had in the last few years, I think the run times and RAM requirements may have improved. Until now, I have considered several de novo assembly algorithms for PacBio data: Canu, Flye, SMARTdenovo, wtdgb2. Do you have any idea how much time-consuming will be to try different pipelines with those algorithms to obtain the best assembly? I will also use other tools like Quickmerge or Pilon to correct the assembly with the reference (or assemblies generated by other algorithms).

Additionally, I would like to know your opinions about the coverage that the sequencing service has offered to us. They offered around 15x with Pacbio HiFi. However, according to what I have read, the minimum recommended coverage is usually around 20x. The plant has a reference genome and they will also sequence our samples with high-quality short fragments from Illumina. Could problems arise due to the coverage suggested by the sequencing service? The purpose of the study is to get a better understanding of different varieties of the same plant. Thank you in advance.

sequencing genome assembly plant de-novo • 1.6k views

ADD COMMENT • link updated 2.2 years ago by colindaven 6.4k • written 2.2 years ago by antmantras ▴ 80

score 1 · Answer 1 · 2022-01-29

1

Entering edit mode

2.2 years ago

Peter ▴ 10

wtdgb2 is the least hardware hungry (in my experience). Maybe less than 24 hours. Per assembly - you would want to parameter scan this.

Canu have a Hi Fi specific program. I haven't used this, but I have heard good things.

15X is low - can you get more?

I dont think correcting an assembly against other assemblies generated is a good approach. You need to polish it use Arrow (although, you need to check this for HiFi reads) and/ or Pilon.

Pilon - you will need Illumina data (as you are getting). And LOTs of resource. ~2 TB RAM for it to run in a reasonable time frame. Maybe 3 - 5 iterations.

ADD COMMENT • link 2.2 years ago by Peter ▴ 10

0

Entering edit mode

Hi Peter. Thank you for your response! I think we can not get more coverage, at least, for now, due to our budget. But, in case it is strictly necessary we could ask for more. That is the reason why I was considering the use of Quickmerge with the draft assembly and the reference genome published. Additionally, we will get Illumina reads to correct the assembly. Do you think that the low coverage of PacBio reads could still be a problem?

ADD REPLY • link 2.2 years ago by antmantras ▴ 80

score 1 · Answer 2 · 2022-01-31

1

Entering edit mode

2.2 years ago

colindaven 6.4k

Flye is very useful, wtdbg2 fast but leads to poor so non-contiguous results, canu is good but resource hungry (might need a week or more on a cluster, the read correction phase takes ages).

Hardware for this - difficult to say, but probably less than 24-48 hours on a modern 56 core 512GB RAM server. Try it.

For assembly correction, use 2 x racon then Medaka.

Disagree - Pilon does not need 2TB RAM. Have used this many times for similarly sized genomes without ever needing that much RAM.

ADD COMMENT • link 2.2 years ago by colindaven 6.4k

0

Entering edit mode

Hi colindaven, thanks for your reply! My main concern, as I have commented to Peter, is our budget. We do not have access to a powerful enough cluster, so we must use external computation services like Gcloud or AWS. Do you think that 256 GB RAM and 32 CPU cores could work? At least for testing with canu. The other algorithms should use fewer resources.

Thanks for the suggestion for the correction step. I have been reading a bit about Medaka, it seems it was developed to get consensus sequences from Nanopore reads. In our case the sequencing technology used will be PacBio, would Medaka fit well with our data? Should I run it with the default model? It seems to have several models, depending on the type of Nanopore device used.

Finally, what do you think about the coverage suggested by the sequencing service? Could it be too low? Initially, it is the one that fits our budget but I can try to increase it. To address it, I have considered the use of Quickmerge (with our draft assembly and the reference genomes available) and the use of Illumina reads to correct the assembly.

ADD REPLY • link 2.2 years ago by antmantras ▴ 80

1

Entering edit mode

You're correct, medaka should not be used with pacbio, missed that.

Assemblers such as Hifiasm from Heng Li's group will also be worth a try. I haven't had HiFi data myself but hear it is very good. Not sure what coverage is required though. You will definitely get a better assembly with more HiFi coverage, but check HiFi assembly papers for exact details. Why not try the assembly first and see if you are happy with it before increasing the coverage ?

ADD REPLY • link 2.2 years ago by colindaven 6.4k