Question: What improvements would you recommend for this genome scaffolding software?
 
16
 
 

I've written a software tool that allows genome scaffolds to be reliably reproduced by writing the set of instructions to build the scaffold as a domain specific language. The software, "Scaffolder," parses this instruction file, fetches the corresponding contig sequences, and joins them together into a continuous super-sequence. Separating the contig-joining process into a separate file decouples the data from the steps required to build the scaffold.

I'm writing on BioStar because I hope this software will be useful to the bioinformatics and genomics community. Therefore any patches, comments or constructive criticism of this software will improve and, ideally, make this a useful resource.

Finally, in addition, this software has been submitted to the journal Open Research Computation. Therefore any comments made on this question directly feed into the peer-review process for the article. I believe this could be an interesting approach to peer-review and will add to suggestions made by the two reviewers.

Please separate suggestions into individual answers so they can be voted on individually. Multiple answers and votes are very welcome.

 
 
 

can I vote twice ? :-)

log in to reply • written 14 months ago by Pierre Lindenbaum ♦♦ 351432768
 

Vote as many times as you like? :) I feel in unexplored territory.

log in to reply • written 14 months ago by Michael Barton  163314
 

The name Scaffolder has already been used for scaffolding software in the original [Celera WGS assembler written by Gene Myers: http://www.sciencemag.org/content/287/5461/2196.abstract :(

log in to reply • written 14 months ago by Casey Bergman  123921131
 
1

The name Scaffolder has already been used for scaffolding software in the original Celera WGS assembler written by Gene Myers: http://www.sciencemag.org/content/287/5461/2196.abstract :(

log in to reply • written 14 months ago by Casey Bergman  123921131
 
1

You should probably include a discussion of the pros/cons of your YAML file format vis-a-vis the standard AGP file format in the manuscript.

log in to reply • written 14 months ago by Casey Bergman  123921131
 

A different name would be useful then to distinguish the software. I spent a while originally trying to think of different names but Scaffolder was the best I could come up with.

log in to reply • written 14 months ago by Michael Barton  163314
 

Thanks for the suggestion on AGP. I'll look into this format in more detail. Is there a tool that converts AGP into the corresponding scaffold sequence?

log in to reply • written 14 months ago by Michael Barton  163314
 

How about "contigs2scaffolds" or "Scaffixer", in honor of the first patented scaffolding technology: http://www.scaffoldersforum.com/scaffolders-forum/2089-history-scaffolding.html. Other potential scaffolding related terminology can be found here: http://www.builderbill-diy-help.com/formwork-glossary.html

log in to reply • written 14 months ago by Casey Bergman  123921131
 

Thanks Casey. Scaffolding related terms are an excellent idea. :)

log in to reply • written 13 months ago by Michael Barton  163314

5 answers

 
6
 
 

I would like it if Scaffolder would create a starter YAML file from an AGP file which is a format produced by Newbler amongst others.

Description of the AGP file is here: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml

 
 
 

Thanks Nick. It should be relatively straight forward to write a conversion script for AGP to YAML. Are there any other common formats in addition to AGP?

log in to reply • written 14 months ago by Michael Barton  163314
 

Not that I'm aware. But one possible format to be aware of are short-read assemblers like Velvet which produce "scaffolded contigs" which are contigs separated by Ns of known length. It would be quite nice to turn that into a YAML description too.

log in to reply • written 14 months ago by Nick Loman  102
 

Do these assembers produce AGP output along with the scaffolded contigs? Otherwise I think it would require using sequence alignment to determine which contigs are in which scaffold. Not impossible but more room for error.

log in to reply • written 13 months ago by Michael Barton  163314
 
 
5
 
 

It would be cool if there was a tool to convert contig-relative coordinates (like those found in a gff3 files) to scaffold-relative coordinates and back using just the Scaffolder file

 
 
 

Thanks Jeremy. This is very important and something I have been thinking about. If you had a set of contigs that were already annotated then joining them into a scaffold should produce the corresponding set of combined gene annotation locations. This would make it much simpler to rebuild and update an annotated genome. There a one hurdles to this though. The original annotated contig sizes may be edited in the draft scaffold which would require changing all the gene coordinates downstream of this point. This is by no means impossible though and is something I've have already tried hacking toge

log in to reply • written 14 months ago by Michael Barton  163314
 

Started working on this http://bit.ly/fNQiaf . Any suggestions are welcome http://bit.ly/i7LlmA.

log in to reply • written 13 months ago by Michael Barton  163314
 
 
4
 
 

It would be nice to include provision for describing circular genomes in your YAML file.

 
 
 

Thanks Casey. That's a good suggestion. So far I have been writing circular genomes in Scaffolder by splitting the first contig so that the origin of replication appears first in the file.

I could add an 'origin:coordinate' attribute which would define where the genome should start in the fasta file. Would that match your suggestion?

log in to reply • written 14 months ago by Michael Barton  163314
 

This would be good to add, but I was thinking more of a global attribute that somehow describes that the scaffold is circular and that the last contig connects to the first.

log in to reply • written 14 months ago by Casey Bergman  123921131
 
 
4
 
 

In the manuscript, I cannot find a discussion of other, similar software. What do the sequencing centers use to get their scaffolds? In which way is scaffolder different from the existing software?

 
 
 

I'd definitely say this is an important aspect of scene setting for the paper (Full disclosure: I'm editor in chief of ORC)

log in to reply • written 14 months ago by Cameron Neylon  52
 

Thanks Max. AFAIK the only other option for generating scaffold from writing manual configuration files is BAMBUS - http://bit.ly/gplWvH. If you know any of any other software that does this suggestions are very welcome.

log in to reply • written 13 months ago by Michael Barton  163314
 

In short, scaffolder aims to take the manual process of producing a larger sequence from individual contigs and make it versionable and reproducible. Write the scaffold file, run scaffolder and you will always get the same output sequence.

log in to reply • written 13 months ago by Michael Barton  163314
 

Googled for "scaffolding software bioinformatics" and found this one: SSPACE http://www.ncbi.nlm.nih.gov/pubmed/21149342, the paper also states that SOAP and Abyss have their own "built-in" scaffolders.

In which way is your Scaffolder different from Bambus?

Are these different from Sopra (PMID20576136) ?

I remember that the stone-age tool consed must have had a textfile format to do the scaffolding. Is this different from SSPACE or Scaffolder?

Is there any good reason to deviate from the well-established AGP format?

log in to reply • written 13 months ago by Maximilianh  6426
 

I can't read the SSPACE article as it's behind a paywall. From the abstract it appears that SSPACE algorithmically joins separate contigs using paired read data. Similarly SOPRA also uses paired read data to join unassembled contigs into a larger sequence.

log in to reply • written 13 months ago by Michael Barton  163314
 

Scaffolder provides no algorithms for scaffolding as there are many tools for this already available. I've tried to review most of these in the introduction of the manuscript. The aim of scaffolder is instead to allow manual editing of genome scaffolds using the readable YAML syntax. The scaffold fasta sequence can then be reliably reproduced from this scaffold syntax file.

log in to reply • written 13 months ago by Michael Barton  163314
 

Compared with Bambus, Scaffolder focuses solely on allowing the manual editing and joining of contigs to produce a genome scaffold. I believe Scaffolder may also be easier to install since it only requires one command line call to the rubygems package management system.

log in to reply • written 13 months ago by Michael Barton  163314
 

Consed requires signing a academic user agreement and providing your IP address so that you can download the software. For commercial use the software has to be paid for. In comparison Scaffolder is open-source and MIT Licensed.

log in to reply • written 13 months ago by Michael Barton  163314
 

Comparison with the AGP format requires a longer description. Essentially though the AGP describes how scaffolds are composed of the constituent contigs but, as far as I know, there are no tools that can take AGP as input and produce the described scaffold as an output. Therefore you can't edit and build scaffolds using the AGP format as a base. Building a script to convert between scaffolder and AGP files would however allow this. I would also argue that YAML-based formats are easier to read and edit compared with tab-delimited formats.

log in to reply • written 13 months ago by Michael Barton  163314
 
 
3
 
 

A common problem I have is that I will make an assembly - join contigs together and then leave it in draft form.

Later on, there may be an update to the assembler I use (usually Newbler) - or I may get some new data - and I will re-run the assembly. Often the assembly is materially the same - perhaps a bit improved - but the contig names will have changed.

It would be great if Scaffolder had a way so I could port my joins from the original assembly identifiers to the new assembly in a way that handled unambiguous joins but flagged up any potential discrepancies.

 
 
 

It should be possible to find identical contigs just by hashing the encoding sequence. Very similar contigs might be identified using an alignment algorithm. Based on this it should be possible contrast the sequence between builds and highlight differences.

log in to reply • written 14 months ago by Michael Barton  163314
 
Log in to add a post