Tool: Reading/ Writing FASTA files in Java
0
gravatar for alex
4.3 years ago by
alex0
Germany
alex0 wrote:

FASTA is not the most complicated file format ;) but nevertheless maybe this library might save you five or ten minutes.
It supports sequence validation, header dialects and stream based reading and writing.
http://sourceforge.net/projects/jfasta/

parsing library tool java fasta • 3.0k views
ADD COMMENTlink modified 4.3 years ago by robert.davey260 • written 4.3 years ago by alex0
1

I think parsing fasta files correctly is a task of under-appreciated complexity. There are great many pitfalls along the way - for example what if the entire chromosome 1 of the human genome were on a single fasta line - most tools will fail in various spectacular fashion.

That being said your package lacks the most important aspect of creating something that is useful for others: that of documentation that your package seems to lack that altogether. Many people, including myself would never make use of a undocumented software - as I learned to correlate the documentation quality to software quality. 

 

ADD REPLYlink written 4.3 years ago by Istvan Albert ♦♦ 79k

Hello Istvan,

thanks for your reply. Concerning documentation I would like to draw your attention to the JavaDoc http://jfasta.sourceforge.net/apidocs/index.html and the examples http://jfasta.sourceforge.net/example1.html.

Reading one whole chromosome from one single line will be indeed problematically not only for this library. On the other hand, it will be problematically in the fist place to write the whole chromosome to one single line for most machines out there. Anyways I would not hesitate to implement a fix to this rare usecase as soon as I myself run into this problem or any user runs into this problem who is reporting it to me and can provide me with according input files for testing and reproducing.

There is also issue tracking available http://jfasta.sourceforge.net/issue-tracking.html

 

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by alex0
2

Writing out a very long lines is actually trivially easy: strip off the end of line terminator while keeping on writing one line at a time. It is true that files like that are rare (because they do crash tools) but are valid FASTA files nonetheless. PyFasta does that for example: https://pypi.python.org/pypi/pyfasta/#flattening

ADD REPLYlink written 4.3 years ago by Istvan Albert ♦♦ 79k

Hmm. Reading and writing arbitrarily long lines is not difficult unless you are reading an entire line at a time, which most tools are. If you read and work with the file in chunks and check each chunk for a line terminator then you can work under 64kB of RAM. 

ADD REPLYlink written 4.3 years ago by Matt Shirley8.9k

What your documentation is missing are examples of how the library is used in practice. Or  what do the terms "sequence validation" or "header dialect parsing" actually mean. A library is useful when it saves one time from digging through other people's code.

ADD REPLYlink written 4.3 years ago by Istvan Albert ♦♦ 79k
3
gravatar for robert.davey
4.3 years ago by
robert.davey260
European Union
robert.davey260 wrote:

The Picard library, specifically the samtools portion of the HTSJDK also provides very well-documented FASTA/Q parsing.

 

ADD COMMENTlink written 4.3 years ago by robert.davey260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2558 users visited in the last hour