Tool:Keats: A Scala (And Scalable) Port Of The Picard Genomics Library Of The Broad Institute.
0
4
Entering edit mode
8.2 years ago
William ★ 5.0k

I started a port of the Picard Genomics library of the Broad Institute (Harvard/MIT) to Scala.

• Keats is a Scala port of the Picard library of the Broad Institute of Harvard and MIT. Picard is one of the most advanced and widely used libraries for representation and processing of Genomic data from BAM, VCF and BCF files. http://picard.sourceforge.net/

• The goal of the port is to use the advanced language features of Scala to make the library easier to understand / use and more scalabe for use in Genomics research. A personal side goal is to increase my knowledge of Scala and advanced data structure representation of Genomic variation.

• See Coursera Functional Programming Principles in Scala for the advanced Scala features. https://www.coursera.org/course/progfun

• And of course Akka http://akka.io/

• I already ported the core genomic data representation classes : VariantContext, Allele(Context) and Genotype(Context). Porting can be done class by class and existing unit tests are reused. This is possible because Java and Scala are compatible. I am now working on making the vcf/bcf codec, reader and writer threadsafe.

• Keats is a work in progress and not yet ready for use.

• The code is MIT license.

• Feel free to fork (parts of) the project or help with porting the code to Scala.

• At this moment the project is kind of a research project aimed at people who are also interested in Scala development and the functionality of Picard. For this target group it is easiest to clone the github project (https://github.com/WimS83/Keats.git) into their IDE. I highly recommend IntelliJ as the IDE, but Netbeans also works.

• You also need to install the Scala binaries ( http://www.scala-lang.org/download/ ) and install the Scala plugin in your IDE of choice.

• All the Unit test and testdata of Picard are included in the github project. From inside the IDE you can run the Unit test, and based on the examples of the functionality in the Unit test, you can create your own small programs inside the IDE.

• As the project matures a bit more I will look at packaging and distribution of the compiled binary version of the software.

• At that time the software can hopefully be used as a drop in replacement for a (subset) of the functionality of Picard.

• To sum the last few points, Keats is not yet end user ready software, but aimed at developers interested in Scala and the functionality of Picard.

picard Tool • 2.8k views
2
Entering edit mode

I would suggest adding a few usage examples so that people get a sense of how it is all supposed to work.

2
Entering edit mode

This is not in any way meant to discurage your effort, I think creating new projects is great, however I don't really understand your strategy/goals. I do think scala is a good tool for bioinformatics.

1. strategy:

a. The main benefit of scala is that you can use java libraries (mostly) as they are. What you are doing is a huge effort to rewrite something that you can already use from scala (and I do this a lot and also GATK does it I think). What do you gain from this? I see thread safety

b. You seem to use a the non standard build for Keats (at least I could not find sbt files at the ususal places and your scala files are in src/java/org not in src/scala/...

c. You would constantly play catch-up with picard (and I think rewriting things is quite boring)

2. goals:

a. I see one improvement which is thread safety for some classes, but do you really plan to use them with different threads?

b. What I am interested in would be: small type-safe wrappers around picard like (not the current Object/null/exception throwing methods):

 getStringAttribute(name: tag): Option[String]


or

getStringAttribute(name: tag): \/[Error,String]

  c. a way to process SAM/BAM or other bio-files with e.g. scalaz-stream or machine or similar


I'm too lazy for that, but if anybody has something like this, please make it known.

0
Entering edit mode

I am not planning to port the whole of Picard. I ported the VariantContext classes because I wanted to learn Scala and I wanted to learn how genomic variation is represented within Picard and GATK. The ported VariantContext classes are now more concise. I want to port the codec and reader and writers to see how they work and to see if I can improve the performance by using parallel collections and actors. These two parts can then also be used in new applications. I don't think Keats will replace Picard, but maybe some of the ported code will find its way into, or inspire parts of, Picard or other software.