Dear, I am facing a problem with the Fasciola gigantica assembly. I found a 2.4GB assembly file, which is more than the usual assembly size in the NCBI (maximum 1.5GB in size). I used MEGAHIT to do the assembly due to the error of Spades on my pc. When I submitted my assembly to NCBI, they told me about the oversize of the assembly. How can I get an accurate result with acceptable size? Please help me. Thank you
From the BUSCO result (Complete and duplicated BUSCOs D:13.7%). Here is the full result (|C:28.2%[S:14.5%,D:13.7%],F:34.5%,M:37.3%,n:954,E:19.7% ). How can i improve the result using MEGAHIT? Please help me.
First, your assembly is very bad. Sorry, but the completeness of 28.2% is very low. In 2025, assemblies of large genomes are usually done with long reads (Nanopore or PacBio), which significantly improves the assembly quality compared to short read-only assemblies.
Second, there are a lot of haplotypic duplications indeed. See that "D" is almost equal to "S"? Haplotypic duplications can be removed by specialized tools like Purge_dups (https://github.com/dfguan/purge_dups).
Third, I advise to perform a contamination analysis. For example, take random 100 or 1000 contigs, align them by Megablast to NCBI nt and examine several best matches. It's quite possible that the assembly size is also inflated by contamination, not only by haplotypic duplications.