Difficulties with assembly
1
0
Entering edit mode
5 weeks ago

Dear, I am facing a problem with the Fasciola gigantica assembly. I found a 2.4GB assembly file, which is more than the usual assembly size in the NCBI (maximum 1.5GB in size). I used MEGAHIT to do the assembly due to the error of Spades on my pc. When I submitted my assembly to NCBI, they told me about the oversize of the assembly. How can I get an accurate result with acceptable size? Please help me. Thank you

MEGAHIT Eukaryote Assembly • 540 views
ADD COMMENT
0
Entering edit mode
5 weeks ago
shelkmike ★ 1.7k

Probably there are haplotypic duplications or contamination in your assembly. Did you analyze the assembly with BUSCO? If you did, what was the percentage of multicopy orthogroups (marked as "D" in BUSCO results)?

ADD COMMENT
0
Entering edit mode

From the BUSCO result (Complete and duplicated BUSCOs D:13.7%). Here is the full result (|C:28.2%[S:14.5%,D:13.7%],F:34.5%,M:37.3%,n:954,E:19.7% ). How can i improve the result using MEGAHIT? Please help me.

ADD REPLY
1
Entering edit mode

First, your assembly is very bad. Sorry, but the completeness of 28.2% is very low. In 2025, assemblies of large genomes are usually done with long reads (Nanopore or PacBio), which significantly improves the assembly quality compared to short read-only assemblies.

Second, there are a lot of haplotypic duplications indeed. See that "D" is almost equal to "S"? Haplotypic duplications can be removed by specialized tools like Purge_dups (https://github.com/dfguan/purge_dups).

Third, I advise to perform a contamination analysis. For example, take random 100 or 1000 contigs, align them by Megablast to NCBI nt and examine several best matches. It's quite possible that the assembly size is also inflated by contamination, not only by haplotypic duplications.

ADD REPLY

Login before adding your answer.

Traffic: 3310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6