Question: Comparing Assemblies
5
gravatar for Daniel Standage
9.0 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I'm working with two transcriptome assemblies. They're essentially different versions of the same transcriptome, one of them just includes more data and is therefore (presumably) better quality. A paper was published a couple of years ago with the old assembly, and we're now preparing a manuscript with the new assembly. We want to know which contigs from the old assembly correspond to which contigs from the new assembly.

I have the consensus sequence of all of the contigs from each assembly. I used the formatdb utility to create a BLAST database of the sequences from the old assembly, and then I BLASTed the sequences from the new assembly against this database. So now I just need to process the BLAST results and determine the relationship between contigs from one version to another.

I guess another way of putting it is that we want to determine the synteny between the two versions of the assembly.

Any suggestions about how to proceed?

assembly blast • 3.4k views
ADD COMMENTlink written 9.0 years ago by Daniel Standage3.9k

Maybe you would receive a more specific answer if you could tell us why you want to link the old contigs to the new ones. Are you interested in every single one or in a subset?

ADD REPLYlink written 9.0 years ago by Michael Schubert6.9k

Sorry I've been out of it for over a week with a recent move. To be more specific, what I would really like is to know which contig from assembly A corresponds to which contig from assembly B. This would be easy if it was a 1-to-1 correspondence, but that is highly unlikely. The newer assembly has more sequence data and fewer contigs, so I'm assuming it is better quality. I guess I can hope for a 1-to-many relationship between the assemblies, but I don't know how I would handle a many-to-many relationship. Does this make sense?

ADD REPLYlink written 9.0 years ago by Daniel Standage3.9k
9
gravatar for Haibao Tang
9.0 years ago by
Haibao Tang3.0k
Mountain View, CA
Haibao Tang3.0k wrote:

First of all, your data is transcriptome assembly, not chromosomal or genome assembly, therefore synteny is not applicable here. Synteny can be used to check the differences between contiguous sequences - however, transcripts are assembled separately into non-contiguous 'islands'.

I would collect a few statistics to quantify the difference, for example:

  • How much of assembly version 1 is present in 2 (coverage)?
  • Are there nucleotide differences between contig sequences of the two assemblies?
  • Are any of the changes due to different genotypes used, sequencing technologies or assembly software?

I think any sequence alignment software will do, including BLAST.

ADD COMMENTlink modified 10 months ago by RamRS23k • written 9.0 years ago by Haibao Tang3.0k
4
gravatar for Neilfws
9.0 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The key part of your question is:

So now I just need to process the BLAST results

There's an easy option and a more involved option. The easy option is to run BLAST so that it generates tab-delimited output. In the latest version of BLAST, use "-outfmt 6"; older versions (where you run "blastall") use "-m 8". This makes the output easy to open and process using the tools of your choice (spreadsheets, database import, shell tools...). It should be easy to extract query name, hit name and the start/end coordinates for both.

The more involved option is to write a BLAST parser in the language of your choice - but don't reinvent the wheel. All of the Bio* projects (e.g. Bioperl, Biopython and Bioruby) include libraries to do this. Bioperl, for example, used the Bio::SearchIO module and here is an introduction to its usage.

ADD COMMENTlink modified 10 months ago by RamRS23k • written 9.0 years ago by Neilfws48k

I agree that this would easily produce a list of 'most similar' contigs between the new and old assembly. For unambiguous mapping, one could define a contig pair as (1) new one most similar to old one and (2) vice versa. The question is how to resolve ambiguities when there is no 1:1 mapping.

ADD REPLYlink written 9.0 years ago by Michael Schubert6.9k

Sorry, I've been out of it for over a week with a recent move. @neilfws, before writing the post, I used BLAST with the -m 8 option. So I guess when I say "I need to process the BLAST results" I mean I need to know what to do with them. I feel very comfortable with parsing and text processing, but once I've loaded the data into memory, what do I do then? That's the question I was trying to ask, sorry for the confusion.

ADD REPLYlink modified 10 months ago by RamRS23k • written 9.0 years ago by Daniel Standage3.9k
0
gravatar for Käpt'N Silico
7.3 years ago by
Käpt'N Silico20 wrote:

Go for compressed XML output with BLAST. Tabular format used to change and is not supported or recommended by some parsers any more. XML contains all information in a well-defined format and can still be converted to any other format that BLAST can produce.

ADD COMMENTlink written 7.3 years ago by Käpt'N Silico20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1264 users visited in the last hour