Question: Is There A Precise Specification For Fasta Files?
5
gravatar for Johnny Brown
7.6 years ago by
Johnny Brown130
United States
Johnny Brown130 wrote:

I'm wondering if anyone who's written a parser or had to check for the validity of these files has worked out something more specific than the wikipedia entry.

fasta parsing • 4.0k views
ADD COMMENTlink written 7.6 years ago by Johnny Brown130

TIm  Yates' spec: 

 

 

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum118k
14
gravatar for Neilfws
7.6 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

So far as I know, there is no single authoritative source for FASTA format specification. I normally use the guidelines in section 1 of this BLAST help document from the NCBI.

FASTA is not an especially complicated format:

  • The first line begins with ">"
  • After ">", with no spaces, comes the sequence ID (also containing no spaces)
  • Anything after the ID + whitespace is the sequence Description
  • The sequence itself begins on the next line; must be in a valid alphabet and lines should not exceed 80 characters (but most parsers will read sequence on a single line)

That's more or less it.

ADD COMMENTlink written 7.6 years ago by Neilfws48k

Thanks, that's clear and concise.

You could definitely argue that this is more discussion than necessary for something so simple - my motivation was I had to write a validator/parser and I was frustrated by ambiguity in the wikipedia entry.

ADD REPLYlink written 7.6 years ago by Johnny Brown130
9
gravatar for Lars Juhl Jensen
7.6 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

I would argue that the FASTA format is originally defined as the input format of the program FASTA. So anything that can be parsed by FASTA is valid FASTA format, whereas anything that cannot be parsed by FASTA is not. In other words, the original parser of the format should be viewed as the reference implementation.

ADD COMMENTlink written 7.6 years ago by Lars Juhl Jensen11k
6
gravatar for Johnny Brown
7.6 years ago by
Johnny Brown130
United States
Johnny Brown130 wrote:

This is the tentative grammar I've worked out. It's community wiki so anyone who knows better than me can fix it.

<file>     ::= <token> | <token> <file>
<token>    ::= <ignore> | <seq>
<ignore>   ::= <whitespace> | <comment> <newline>
<seq>      ::= <header> <molecule> <newline>
<header>   ::= ">" <arbitrary text> <newline>
<molecule> ::= <mol-line> | <mol-line> <molecule>
<mol-line> ::= <nucl-line> | <prot-line>
<nucl-line>::= "^[ACGTURYKMSWBDHVNX-]+$"
<prot-line>::= "^[ABCDEFGHIKLMNOPQRSTUVWYZX*-]+$"

The sequence alphabet and associated punctuation are the one letter codes defined by IUPAC and IUBMB. For nucleotide sequences see:

For amino-acid sequences see:

In addition 'J' is sued for mass-spec ambiguity between 'I' and 'L', and '*' for a translation stop in translations from nucleotide sequences.

Note: the use of lowercase is recommended for nucleotide sequences and uppercase for amino-acid sequences. However mixed-case is used for a number of purposes, including as a result of filtering for low complexity regions or sequence repeats, to indicate variations (insertions/deletions) or as an indicator of lower sequencing quality.

ADD COMMENTlink modified 7.0 years ago by Hamish3.1k • written 7.6 years ago by Johnny Brown130
1

Taking neilfws's suggestion, we might define header like this: <header> ::= ">" <seqid> " " <arbitrary text> <newline>; <seqid> = "^[^[:space:]]+$. Also, naturally the "arbitrary" text cannot include a newline.

ADD REPLYlink written 7.6 years ago by Ryan Thompson3.4k

What's a <comment>?

ADD REPLYlink written 7.0 years ago by Chris Maloney330
1
gravatar for Johnny Brown
7.4 years ago by
Johnny Brown130
United States
Johnny Brown130 wrote:

may also be useful: NCBI's C++ toolkit includes a class CFastaReader, if you are using C++.

ADD COMMENTlink written 7.4 years ago by Johnny Brown130

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser fna = open("myfastafile.fna") parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD REPLYlink written 7.1 years ago by Johnny Brown130
0
gravatar for Johnny Brown
7.1 years ago by
Johnny Brown130
United States
Johnny Brown130 wrote:

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:

from cogent.parse.fasta import MinimalFastaParser 
fna = open("myfastafile.fna") 
parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD COMMENTlink written 7.1 years ago by Johnny Brown130
0
gravatar for Mawe
7.1 years ago by
Mawe90
Germany
Mawe90 wrote:

When working with Python you could use Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11. It is simple to use. Here they note that FASTA does not specify the sequence alphabet at all.

ADD COMMENTlink written 7.1 years ago by Mawe90
0
gravatar for Woa
7.1 years ago by
Woa2.7k
United States
Woa2.7k wrote:

Somwhere I read, most probably in the bioinformatics book by David Mount, that the sequence string in a Fasta file can end with an optional '*'(asterisk)

ADD COMMENTlink written 7.1 years ago by Woa2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 714 users visited in the last hour