Question: finding contigs present in one assembly but missing from another
0
gravatar for kallen83
14 days ago by
kallen8310
kallen8310 wrote:

I have two pacbio assemblies for the same plant species, and I need to determine if there are regions represented in one assembly that are missing from the other. I have tried using progressiveMauve, and while I suspect that the information I'm looking for is somewhere in the output I'm having a hard time finding it.

Does anyone have a solution to this problem?

as an update, here are stats on one of the assemblies -- the other is similar to this.

number of contigs: 18355

mean contig size: 27903.8

median contig size: 15781

total size: 512174223

Pretty much every contig is big enough to include repetitive elements of some sort, so blastn output is not of much value.

alignment assembly genome • 217 views
ADD COMMENTlink modified 21 hours ago • written 14 days ago by kallen8310
1

It would be useful to add the size range of the contigs you have. Some of the solutions below may not be usable if you have large contigs. Using a program like LASTZ may be your best bet.

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax40k

simplest approach is using tools like blastal/ blastN or blat for pairwise alignment, considering one assembly (assembly1) as query and another as database/subject (assembly2). Any contigs of assembly2 not showing hit as subject for assembly1 will be specific to assembly2.

ADD REPLYlink written 14 days ago by toralmanvar90

the complicating feature of that approach is repetitive elements missed by DUST and the relevant repetitive elements databases. At a first look it appears I'll need to build a repetitive element db for this species before I can proceed with something like that.

ADD REPLYlink written 13 days ago by kallen8310
1
gravatar for colindaven
13 days ago by
colindaven470
colindaven470 wrote:

I would use a bidirectional blastn program for this. You can predict ORFs and use blastx/blastp if looking at gene content (which might be easier for setting useful e-value cutoffs) or just use blastn. Long contigs will almost always tend to have hits with blastn though.

I have used proteinortho for this quite a lot in the past (at least at gene level). It creates a nice summary table.

ADD COMMENTlink written 13 days ago by colindaven470
1
gravatar for Antonio R. Franco
13 days ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.5k wrote:

Why don't you run a dotlet like program like those that are now being used for comparing a genome with an optical map ? I am refering to DAGChainer. Get the idea from this paper

ADD COMMENTlink modified 13 days ago • written 13 days ago by Antonio R. Franco3.5k

thanks, I'll have a look at that

ADD REPLYlink written 21 hours ago by kallen8310
0
gravatar for mbens
13 days ago by
mbens100
Germany
mbens100 wrote:

CD-HIT should be worth a try. CD-HIT is able to cluster sequences quite fast based on user-defined similarity thresholds. Contigs not assigned to clusters should be unique to the respective assembly.

ADD COMMENTlink modified 13 days ago • written 13 days ago by mbens100
0
gravatar for aindap
13 days ago by
aindap100
United States
aindap100 wrote:

Have you tried Assemblytics? You can use Mummer to align your two assemblies and then use Assemblytics to see the differences.

ADD COMMENTlink written 13 days ago by aindap100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1300 users visited in the last hour