So far as I know, there is no single authoritative source for FASTA format specification. I normally use the guidelines in section 1 of this BLAST help document from the NCBI.
FASTA is not an especially complicated format:
- The first line begins with ">"
- After ">", with no spaces, comes the sequence ID (also containing no spaces)
- Anything after the ID + whitespace is the sequence Description
- The sequence itself begins on the next line; must be in a valid alphabet and lines should not exceed 80 characters (but most parsers will read sequence on a single line)
That's more or less it.
I would argue that the FASTA format is originally defined as the input format of the program FASTA. So anything that can be parsed by FASTA is valid FASTA format, whereas anything that cannot be parsed by FASTA is not. In other words, the original parser of the format should be viewed as the reference implementation.
This is the tentative grammar I've worked out. It's community wiki so anyone who knows better than me can fix it.
<file> ::= <token> | <token> <file> <token> ::= <ignore> | <seq> <ignore> ::= <whitespace> | <comment> <newline> <seq> ::= <header> <molecule> <newline> <header> ::= ">" <arbitrary text> <newline> <molecule> ::= <mol-line> | <mol-line> <molecule> <mol-line> ::= <nucl-line> | <prot-line> <nucl-line>::= "^[ACGTURYKMSWBDHVNX-]+$" <prot-line>::= "^[ABCDEFGHIKLMNOPQRSTUVWYZX*-]+$"
The sequence alphabet and associated punctuation are the one letter codes defined by IUPAC and IUBMB. For nucleotide sequences see:
For amino-acid sequences see:
In addition 'J' is sued for mass-spec ambiguity between 'I' and 'L', and '*' for a translation stop in translations from nucleotide sequences.
Note: the use of lowercase is recommended for nucleotide sequences and uppercase for amino-acid sequences. However mixed-case is used for a number of purposes, including as a result of filtering for low complexity regions or sequence repeats, to indicate variations (insertions/deletions) or as an indicator of lower sequencing quality.
Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser fna = open("myfastafile.fna") parsed = MinimalFastaParser(fna)
parsed will now be an iterable of tuples (head, body) from the individual reads
When working with Python you could use Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11. It is simple to use. Here they note that FASTA does not specify the sequence alphabet at all.