Question: What Improvements Would You Recommend For This Genome Scaffolding Software?
19
gravatar for Michael Barton
6.3 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

I've written a software tool that allows genome scaffolds to be reliably reproduced by writing the set of instructions to build the scaffold as a domain specific language. The software, "Scaffolder," parses this instruction file, fetches the corresponding contig sequences, and joins them together into a continuous super-sequence. Separating the contig-joining process into a separate file decouples the data from the steps required to build the scaffold.

I'm writing on BioStar because I hope this software will be useful to the bioinformatics and genomics community. Therefore any patches, comments or constructive criticism of this software will improve and, ideally, make this a useful resource.

Finally, in addition, this software has been submitted to the journal Open Research Computation. Therefore any comments made on this question directly feed into the peer-review process for the article. I believe this could be an interesting approach to peer-review and will add to suggestions made by the two reviewers.

Please separate suggestions into individual answers so they can be voted on individually. Multiple answers and votes are very welcome.

ADD COMMENTlink modified 6.3 years ago by Maximilian Haeussler1.3k • written 6.3 years ago by Michael Barton1.8k
2

You should probably include a discussion of the pros/cons of your YAML file format vis-a-vis the standard AGP file format in the manuscript.

ADD REPLYlink written 6.3 years ago by Casey Bergman17k
1

The name Scaffolder has already been used for scaffolding software in the original Celera WGS assembler written by Gene Myers: http://www.sciencemag.org/content/287/5461/2196.abstract :(

ADD REPLYlink written 6.3 years ago by Casey Bergman17k

can I vote twice ? :-)

ADD REPLYlink written 6.3 years ago by Pierre Lindenbaum95k

Vote as many times as you like? :) I feel in unexplored territory.

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

The name Scaffolder has already been used for scaffolding software in the original [Celera WGS assembler written by Gene Myers: http://www.sciencemag.org/content/287/5461/2196.abstract :(

ADD REPLYlink written 6.3 years ago by Casey Bergman17k

A different name would be useful then to distinguish the software. I spent a while originally trying to think of different names but Scaffolder was the best I could come up with.

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

Thanks for the suggestion on AGP. I'll look into this format in more detail. Is there a tool that converts AGP into the corresponding scaffold sequence?

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

How about "contigs2scaffolds" or "Scaffixer", in honor of the first patented scaffolding technology: http://www.scaffoldersforum.com/scaffolders-forum/2089-history-scaffolding.html. Other potential scaffolding related terminology can be found here: http://www.builderbill-diy-help.com/formwork-glossary.html

ADD REPLYlink written 6.3 years ago by Casey Bergman17k

Thanks Casey. Scaffolding related terms are an excellent idea. :)

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k
6
gravatar for Nick Loman
6.3 years ago by
Nick Loman110
Nick Loman110 wrote:

I would like it if Scaffolder would create a starter YAML file from an AGP file which is a format produced by Newbler amongst others.

Description of the AGP file is here: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

ADD COMMENTlink written 6.3 years ago by Nick Loman110

Thanks Nick. It should be relatively straight forward to write a conversion script for AGP to YAML. Are there any other common formats in addition to AGP?

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

Not that I'm aware. But one possible format to be aware of are short-read assemblers like Velvet which produce "scaffolded contigs" which are contigs separated by Ns of known length. It would be quite nice to turn that into a YAML description too.

ADD REPLYlink written 6.3 years ago by Nick Loman110

Do these assembers produce AGP output along with the scaffolded contigs? Otherwise I think it would require using sequence alignment to determine which contigs are in which scaffold. Not impossible but more room for error.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k
5
gravatar for Jeremy Leipzig
6.3 years ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

It would be cool if there was a tool to convert contig-relative coordinates (like those found in a gff3 files) to scaffold-relative coordinates and back using just the Scaffolder file

ADD COMMENTlink written 6.3 years ago by Jeremy Leipzig17k

Thanks Jeremy. This is very important and something I have been thinking about. If you had a set of contigs that were already annotated then joining them into a scaffold should produce the corresponding set of combined gene annotation locations. This would make it much simpler to rebuild and update an annotated genome. There a one hurdles to this though. The original annotated contig sizes may be edited in the draft scaffold which would require changing all the gene coordinates downstream of this point. This is by no means impossible though and is something I've have already tried hacking toge

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

Started working on this http://bit.ly/fNQiaf . Any suggestions are welcome http://bit.ly/i7LlmA.

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k
4
gravatar for Nick Loman
6.3 years ago by
Nick Loman110
Nick Loman110 wrote:

A common problem I have is that I will make an assembly - join contigs together and then leave it in draft form.

Later on, there may be an update to the assembler I use (usually Newbler) - or I may get some new data - and I will re-run the assembly. Often the assembly is materially the same - perhaps a bit improved - but the contig names will have changed.

It would be great if Scaffolder had a way so I could port my joins from the original assembly identifiers to the new assembly in a way that handled unambiguous joins but flagged up any potential discrepancies.

ADD COMMENTlink written 6.3 years ago by Nick Loman110

It should be possible to find identical contigs just by hashing the encoding sequence. Very similar contigs might be identified using an alignment algorithm. Based on this it should be possible contrast the sequence between builds and highlight differences.

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k
4
gravatar for Casey Bergman
6.3 years ago by
Casey Bergman17k
Manchester, UK
Casey Bergman17k wrote:

It would be nice to include provision for describing circular genomes in your YAML file.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Casey Bergman17k

Thanks Casey. That's a good suggestion. So far I have been writing circular genomes in Scaffolder by splitting the first contig so that the origin of replication appears first in the file.

I could add an 'origin:coordinate' attribute which would define where the genome should start in the fasta file. Would that match your suggestion?

ADD REPLYlink written 6.3 years ago by Michael Barton1.8k

This would be good to add, but I was thinking more of a global attribute that somehow describes that the scaffold is circular and that the last contig connects to the first.

ADD REPLYlink written 6.3 years ago by Casey Bergman17k
4
gravatar for Maximilian Haeussler
6.3 years ago by
UCSC
Maximilian Haeussler1.3k wrote:

In the manuscript, I cannot find a discussion of other, similar software. What do the sequencing centers use to get their scaffolds? In which way is scaffolder different from the existing software?

ADD COMMENTlink written 6.3 years ago by Maximilian Haeussler1.3k

I'd definitely say this is an important aspect of scene setting for the paper (Full disclosure: I'm editor in chief of ORC)

ADD REPLYlink written 6.3 years ago by Cameron Neylon60

Thanks Max. AFAIK the only other option for generating scaffold from writing manual configuration files is BAMBUS - http://bit.ly/gplWvH. If you know any of any other software that does this suggestions are very welcome.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

In short, scaffolder aims to take the manual process of producing a larger sequence from individual contigs and make it versionable and reproducible. Write the scaffold file, run scaffolder and you will always get the same output sequence.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

Googled for "scaffolding software bioinformatics" and found this one: SSPACE http://www.ncbi.nlm.nih.gov/pubmed/21149342, the paper also states that SOAP and Abyss have their own "built-in" scaffolders.

In which way is your Scaffolder different from Bambus?

Are these different from Sopra (PMID20576136) ?

I remember that the stone-age tool consed must have had a textfile format to do the scaffolding. Is this different from SSPACE or Scaffolder?

Is there any good reason to deviate from the well-established AGP format?

ADD REPLYlink written 6.2 years ago by Maximilian Haeussler1.3k

I can't read the SSPACE article as it's behind a paywall. From the abstract it appears that SSPACE algorithmically joins separate contigs using paired read data. Similarly SOPRA also uses paired read data to join unassembled contigs into a larger sequence.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

Scaffolder provides no algorithms for scaffolding as there are many tools for this already available. I've tried to review most of these in the introduction of the manuscript. The aim of scaffolder is instead to allow manual editing of genome scaffolds using the readable YAML syntax. The scaffold fasta sequence can then be reliably reproduced from this scaffold syntax file.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

Compared with Bambus, Scaffolder focuses solely on allowing the manual editing and joining of contigs to produce a genome scaffold. I believe Scaffolder may also be easier to install since it only requires one command line call to the rubygems package management system.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

Consed requires signing a academic user agreement and providing your IP address so that you can download the software. For commercial use the software has to be paid for. In comparison Scaffolder is open-source and MIT Licensed.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k

Comparison with the AGP format requires a longer description. Essentially though the AGP describes how scaffolds are composed of the constituent contigs but, as far as I know, there are no tools that can take AGP as input and produce the described scaffold as an output. Therefore you can't edit and build scaffolds using the AGP format as a base. Building a script to convert between scaffolder and AGP files would however allow this. I would also argue that YAML-based formats are easier to read and edit compared with tab-delimited formats.

ADD REPLYlink written 6.2 years ago by Michael Barton1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 766 users visited in the last hour