Question

Is There A Precise Specification For Fasta Files?

5

Entering edit mode

12.8 years ago

Johnny Brown ▴ 140

I'm wondering if anyone who's written a parser or had to check for the validity of these files has worked out something more specific than the wikipedia entry.

fasta parsing • 6.4k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 12.8 years ago by Johnny Brown ▴ 140

0

Entering edit mode

Tim Yates' spec:

@pathogenomenick haha, isn't the standard:

> ANYTHING
ANYTHINGANYTHINGANYTHING
(repeat as much as you like)
— т̶̱̫̻̩͖͈̜͑̄ϊ̎̿ͧͪͬͤ̽͒҉̶̞̲̮͎̦м̻ͥ̔ͫ (@tim_yates) January 24, 2015

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.8 years ago by Pierre Lindenbaum 161k

score 14 · Answer 1 · 2011-08-18

14

Entering edit mode

12.8 years ago

Neilfws 49k

So far as I know, there is no single authoritative source for FASTA format specification. I normally use the guidelines in section 1 of this BLAST help document from the NCBI.

FASTA is not an especially complicated format:

The first line begins with ">"
After ">", with no spaces, comes the sequence ID (also containing no spaces)
Anything after the ID + whitespace is the sequence Description
The sequence itself begins on the next line; must be in a valid alphabet and lines should not exceed 80 characters (but most parsers will read sequence on a single line)

That's more or less it.

ADD COMMENT • link 12.8 years ago by Neilfws 49k

0

Entering edit mode

Thanks, that's clear and concise.

You could definitely argue that this is more discussion than necessary for something so simple - my motivation was I had to write a validator/parser and I was frustrated by ambiguity in the wikipedia entry.

ADD REPLY • link 12.7 years ago by Johnny Brown ▴ 140

score 9 · Answer 2 · 2011-08-18

I would argue that the FASTA format is originally defined as the input format of the program FASTA. So anything that can be parsed by FASTA is valid FASTA format, whereas anything that cannot be parsed by FASTA is not. In other words, the original parser of the format should be viewed as the reference implementation.

Hamish · Answer 3 · 2011-08-18

This is the tentative grammar I've worked out. It's community wiki so anyone who knows better than me can fix it.

<file>     ::= <token> | <token> <file>
<token>    ::= <ignore> | <seq>
<ignore>   ::= <whitespace> | <comment> <newline>
<seq>      ::= <header> <molecule> <newline>
<header>   ::= ">" <arbitrary text> <newline>
<molecule> ::= <mol-line> | <mol-line> <molecule>
<mol-line> ::= <nucl-line> | <prot-line>
<nucl-line>::= "^[ACGTURYKMSWBDHVNX-]+$"
<prot-line>::= "^[ABCDEFGHIKLMNOPQRSTUVWYZX*-]+$"

The sequence alphabet and associated punctuation are the one letter codes defined by IUPAC and IUBMB. For nucleotide sequences see:

For amino-acid sequences see:

http://www.chem.qmul.ac.uk/iupac/AminoAcid/

In addition 'J' is sued for mass-spec ambiguity between 'I' and 'L', and '*' for a translation stop in translations from nucleotide sequences.

Note: the use of lowercase is recommended for nucleotide sequences and uppercase for amino-acid sequences. However mixed-case is used for a number of purposes, including as a result of filtering for low complexity regions or sequence repeats, to indicate variations (insertions/deletions) or as an indicator of lower sequencing quality.

score 1 · Answer 4 · 2011-11-04

1

Entering edit mode

12.5 years ago

Johnny Brown ▴ 140

may also be useful: NCBI's C++ toolkit includes a class CFastaReader, if you are using C++.

ADD COMMENT • link 12.5 years ago by Johnny Brown ▴ 140

0

Entering edit mode

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:
from cogent.parse.fasta import MinimalFastaParser fna = open("myfastafile.fna") parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD REPLY • link 12.3 years ago by Johnny Brown ▴ 140

score 0 · Answer 5 · 2012-01-30

0

Entering edit mode

12.3 years ago

Johnny Brown ▴ 140

Also, PyCogent has a parser for fasta files in python. After installing cogent, use it like so:

from cogent.parse.fasta import MinimalFastaParser 
fna = open("myfastafile.fna") 
parsed = MinimalFastaParser(fna)

parsed will now be an iterable of tuples (head, body) from the individual reads

ADD COMMENT • link 12.3 years ago by Johnny Brown ▴ 140

Ram · Answer 6 · 2012-01-30

0

Entering edit mode

12.3 years ago

Mawe ▴ 90

When working with Python you could use Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11. It is simple to use. Here they note that FASTA does not specify the sequence alphabet at all.

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 12.3 years ago by Mawe ▴ 90

score 0 · Answer 7 · 2012-01-30

0

Entering edit mode

12.3 years ago

Woa ★ 2.9k

Somwhere I read, most probably in the bioinformatics book by David Mount, that the sequence string in a Fasta file can end with an optional '*'(asterisk)

ADD COMMENT • link 12.3 years ago by Woa ★ 2.9k