Hi all!
For a plant genome assembly project, I have 55x CLR, 28x HiFi and 14x Illumina reads.
Which workflow would be the best to get the most of the data?
Hifiasm alone resulted in 417 contigs with L90: 58 and N50: 20Mb. 96.8% Busco score
The best scaffold result is 307 scaffolds with L90: 46, N50: 30Mb, gaps: 0.015%, 96.7% Busco score. I have canu correcttrimmed the CLR reads and (Racon) polished them with the Hifi reads before using them to scaffold (LRScaf) above mentioned hifiasm assembly. Then I racon and pilon polished the scaffolded assembly to achieve above mentioned best result.
Is this already the best approach? Or is it possible that I introduced errors to the hifiasm assembly by adding the CLR information? I know that Hi-C would improve the hifiasm assembly much more, but I got the task to test out everything with the provided data.
I would be very thankful for your advice and thoughts on this!
Thank you for your reply lieven.sterck!
It is nice to know, that I am on the right track.
Yes, the scaffold length histogram shows that I have around 30 large ones (60Mb - 10Mb) and the rest are small (<10Mb).
The estimated genome size is 1.2Gb with 15 chromosomes.
Okay, I am running a CLR only flye assembly right now to test it. Or do you mean a flye assembly with the HiFi reads?
I think flye accepts all those inputs (or can be tricked into accepting them all ;) ) , so I would personally run an assembly with all possible data to start off with.
Flye has quite decent documentation and 'protocols' so go and have a look at it.
Ah interesting, thanks for this tip lieven.sterck!
After reading through some closed issues about hybrid assemblies and the faq, docs I am running this hybrid approach:
Then resume flye-polishing with HiFireads as recommended in the faq (under hybrid assembly with HiFi and ONT).