Hi Everyone,
A bit of background.
I am working with fungal genomes. Where i have to generate De-novo geneome assemblies for roughly around 45 illumina samples and 12 oxford nanopore Long read samples. These 12 ONT (Oxford Nanopore Technology) samples are also included in 45 illumina samples.
Now as soon as i get the sequences, i will be running multiple assemblers,
Illumina: SPADES + other for testing and comparison
ONT: Canu, Flye, SPADES for hybrid assembly and aslo just ONT-assembly.
The question that came up is "WHAT will mark a genome assembly to be a FAILED assembly for a particular assembler?" Are there any specific guidelines or any set of criteria that needs to be met.
Your Kind suggestions/thoughts on this mean alot. Thank you
I would urge you to consider different approaches for benchmarking before deciding on one. In particular, I found that assembling long reads and short reads together (hybrid) doesn't necessarily give you a better assembly.
For the samples where you have long+short reads, I would suggest,
For the samples with no long reads, SPADES usually produces good assemblies for the size of genome you are dealing with, for more heterozygous genomes, PLATANUS also produced good assemblies for me in the past. Once you have finished assembling the short read only samples, you should run steps 5 and 6 on them as well.
In terms of assessing your assembly and whether it has "failed", there is no single metric from the stats that would tell you that. I don't like judging assemblies based on stats because they can be misleading and they tell you nothing about how well the reads assembled, i.e. the biology of the assembly. That is where BUSCO comes in, so in my case I would sacrifice N50/lengths etc, for higher BUSCO scores.
If however you get a significantly lower genome length than what you expect, there is a good chance that you either have low coverage, or perhaps high contamination (which obviously reduces your coverage even when the total number of reads is high).
One caveat to the method I suggested is that you need higher coverage from long reads (over 30x) for it to yield good results, if you don't then you can stick to the hybrid assembly approach where you co-assemble the long and short reads together.
Hope this helps and all the best.
Hi, Thank you for a detailed responce.
Let me add some more informations. Long-Read is ~75X coverage. Short-Read is ~100X coverage.
For Short-Read Samples, I am planning to go with SPADES as i got good results and number of contigs, with good assembly stats and BUsco in range above 90%
For Long-Read Samples, Initial plan was to do hybrid assembly from the start. but based on your reply, as i have higher coverage, I will also test your approach. I just need some clerifications.
What do you think of this approach. Specifically the polishing part with Pilon and RECON.
THanks.
I'm a little confused about your experimental design. Are you making 45 different assemblies? Or are all the samples from the same individual? Or are you making a pangenome assembly?
What do you mean by "fail"? The tools will likely emit even a very fractured assembly given bad data.
Yes. We are sequencing 45 different samples with illumina. 12 of these are also going to be sequenced with Nanopore (for hybrid assembly). Pangenome is also a future target.
Current target is to sequence. Check if strains have chromosomes which are not core chromosomes. And if there are any pathogenicity related genes. Etc.
By “fail” what i wanna ask is
I'm unsure you'll be able to identify whole chromosomes that are not core from illumina reads alone. That might be easier to do with a microscope. Illumina reads alone will not result in a chromosome level assembly. You'll likely get an okay idea of unique contigs though.
You could also check for BUSCOs to give you an idea of genome completeness. But if you're expecting extra chromosomes in some strains, then surely your size estimate wouldn't really be that useful. Or am I missing something?
I know illumina will only give me high-quality contigs. the samples which are to be sequenced with Long_Short read sequence will be used as reference to further join the contigs or atleast identify the accessory and core contigs.