Question

Difficulties with assembly

0

Entering edit mode

12 weeks ago

Mokammel Hossain • 0

Dear, I am facing a problem with the Fasciola gigantica assembly. I found a 2.4GB assembly file, which is more than the usual assembly size in the NCBI (maximum 1.5GB in size). I used MEGAHIT to do the assembly due to the error of Spades on my pc. When I submitted my assembly to NCBI, they told me about the oversize of the assembly. How can I get an accurate result with acceptable size? Please help me. Thank you

MEGAHIT Eukaryote Assembly • 631 views

ADD COMMENT • link updated 12 weeks ago by shelkmike ★ 1.8k • written 12 weeks ago by Mokammel Hossain • 0

score 0 · Answer 1 · 2025-08-06

0

Entering edit mode

12 weeks ago

shelkmike ★ 1.8k

Probably there are haplotypic duplications or contamination in your assembly. Did you analyze the assembly with BUSCO? If you did, what was the percentage of multicopy orthogroups (marked as "D" in BUSCO results)?

ADD COMMENT • link 12 weeks ago by shelkmike ★ 1.8k

0

Entering edit mode

From the BUSCO result (Complete and duplicated BUSCOs D:13.7%). Here is the full result (|C:28.2%[S:14.5%,D:13.7%],F:34.5%,M:37.3%,n:954,E:19.7% ). How can i improve the result using MEGAHIT? Please help me.

ADD REPLY • link 12 weeks ago by Mokammel Hossain • 0

1

Entering edit mode

First, your assembly is very bad. Sorry, but the completeness of 28.2% is very low. In 2025, assemblies of large genomes are usually done with long reads (Nanopore or PacBio), which significantly improves the assembly quality compared to short read-only assemblies.

Second, there are a lot of haplotypic duplications indeed. See that "D" is almost equal to "S"? Haplotypic duplications can be removed by specialized tools like Purge_dups (https://github.com/dfguan/purge_dups).

Third, I advise to perform a contamination analysis. For example, take random 100 or 1000 contigs, align them by Megablast to NCBI nt and examine several best matches. It's quite possible that the assembly size is also inflated by contamination, not only by haplotypic duplications.

ADD REPLY • link 12 weeks ago by shelkmike ★ 1.8k