Question: Troubleshooting assembly contigs of large genome
0
gravatar for arunprasanna83
6 months ago by
arunprasanna8330 wrote:

Hello,

Following is my strange situation: I assembled genomes from same sample (haploid source) with two different methods. Assembly size of two methods are the following:

Method1 = 900 Mb (5400 contigs >10kb)

Method2 = 500 Mb (7000 contigs >10kb)

I suspected duplication in Method 1 and checked for completeness with BUSCO. Surprisingly both the methods gave similar completeness values with no diploid in Method1. Hence, I am highly curious to know where the extra 400 Mb is coming from. For this, I am trying to align the sequences and visualize them. But due to large file size almost most of the methods are failing. For instance, I tried

  1. minidot - error at installation level after repeated attempts

  2. LASTZ alignment -> maf -> aliTV. It fails in the alignment step itself

  3. mummer/nucmer --> the given length exceeds allowed limit (I am using 64-bit version, still fails)

  4. LAST generates around 300 GB of MAF file, which is not readable by any downstream application

  5. Gepard - hangs !

I feel like hitting the dead-end. Kindly let me know, how to handle this situation. I am very curious to know where this extra seqs are from !.

Thanks in advance.

ADD COMMENTlink written 6 months ago by arunprasanna8330

Your hunch is not that far off probably, likely it is indeed due to redundancy in method 1.

BUSCO might not show this because that is only looking at the genic part of the assembly, the redundancy might very well be in gene-poor (or even gene-less) regions.

Which version of mummer are you truing to run?

why not give good-old blast a try? if it's simply to get a first idea , you will be able to get that also from a blast(n) output

ADD REPLYlink written 6 months ago by lieven.sterck4.2k

I am using mummer 3.2.3. As @gconception mentioned I will try 4. How does blastn help ?

ADD REPLYlink written 6 months ago by arunprasanna8330

Well, you could quickly "align" the sequences to each other (blast set 1 against set 2, and/or vice versa) and see if for a single query you get multiple (2?) hits in the other one

ADD REPLYlink modified 6 months ago • written 6 months ago by lieven.sterck4.2k

If you want to use mummer, make sure you are using version 4.0.0 https://github.com/mummer4/mummer/releases

D-GENIES is another dotplot option that works well for large genomes: http://dgenies.toulouse.inra.fr/

Are these assemblies from long reads? What assemblers were used? FALCON & Canu?

ADD REPLYlink written 6 months ago by gconcepcion60

I used mummer 3.2.3. I will give 4 a try !. Btw. D-GENIES web version failed and a local installation is not friendly. The assemblies are from long reads and assembled with Canu.

ADD REPLYlink written 6 months ago by arunprasanna8330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1249 users visited in the last hour