Question

Tool:Reading/ Writing FASTA files in Java

0

Entering edit mode

10.6 years ago

alex • 0

FASTA is not the most complicated file format ;) but nevertheless maybe this library might save you five or ten minutes. It supports sequence validation, header dialects and stream based reading and writing. http://sourceforge.net/projects/jfasta/

FASTA Java • 5.3k views

ADD COMMENT • link updated 2.0 years ago by Ram 45k • written 10.6 years ago by alex • 0

1

Entering edit mode

I think parsing fasta files correctly is a task of under-appreciated complexity. There are great many pitfalls along the way - for example what if the entire chromosome 1 of the human genome were on a single fasta line - most tools will fail in various spectacular fashion.

That being said your package lacks the most important aspect of creating something that is useful for others: that of documentation that your package seems to lack that altogether. Many people, including myself would never make use of a undocumented software - as I learned to correlate the documentation quality to software quality.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Istvan Albert 102k

0

Entering edit mode

Hello Istvan,

thanks for your reply. Concerning documentation I would like to draw your attention to the JavaDoc http://jfasta.sourceforge.net/apidocs/index.html and the examples http://jfasta.sourceforge.net/example1.html.

Reading one whole chromosome from one single line will be indeed problematically not only for this library. On the other hand, it will be problematically in the fist place to write the whole chromosome to one single line for most machines out there. Anyways I would not hesitate to implement a fix to this rare usecase as soon as I myself run into this problem or any user runs into this problem who is reporting it to me and can provide me with according input files for testing and reproducing.

There is also issue tracking available http://jfasta.sourceforge.net/issue-tracking.html

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 10.6 years ago by alex • 0

2

Entering edit mode

Writing out a very long lines is actually trivially easy: strip off the end of line terminator while keeping on writing one line at a time. It is true that files like that are rare (because they do crash tools) but are valid FASTA files nonetheless. PyFasta does that for example: https://pypi.python.org/pypi/pyfasta/#flattening

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Istvan Albert 102k

0

Entering edit mode

Hmm. Reading and writing arbitrarily long lines is not difficult unless you are reading an entire line at a time, which most tools are. If you read and work with the file in chunks and check each chunk for a line terminator then you can work under 64kB of RAM.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Matt Shirley 10k

0

Entering edit mode

What your documentation is missing are examples of how the library is used in practice. Or what do the terms "sequence validation" or "header dialect parsing" actually mean. A library is useful when it saves one time from digging through other people's code.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by Istvan Albert 102k

Ram · Answer 1 · 2014-12-15

3

Entering edit mode

10.6 years ago

robert.davey ▴ 280

The Picard library, specifically the samtools portion of the HTSJDK also provides very well-documented FASTA/Q parsing.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.6 years ago by robert.davey ▴ 280