How to distinguish nucleic acid and amino acid representation in FASTA files?
2
1
Entering edit mode
13 months ago

The FASTA format is used for representing both sequences of nucleic acids and amino acids. Since the one-letter code for nucleic acids largely overlaps with that for amino acids considerable ambiguities exist when reading FASTA files. As far as I could find out there is no standardised information on the type of sequence available in the header line.

Is there a recommended way for automatically distinguishing nucleic acid and amino acid sequences?

FASTA data formats file formats • 576 views
0
Entering edit mode

Interesting anecdote about fasta format by Bill Pearson (inventor) here: C: Was FASTA ever popular?

5
Entering edit mode
13 months ago
Mensur Dlakic ★ 15k

Leucine is the most abundant amino acid in natural proteins at ~9.5%. If you go over couple of hundred residues and there are no Ls in it, chances are solid that it is not a protein. A (alanine) and C (cysteine) are present respectively at ~7.9% and 1.5% in natural proteins, while at least one of them must be >20% in DNA. The sum A+C+G+T frequencies in DNA will be 100% or close to it, but it is <50% in most proteins.

If you think about it, I am sure you can come up some rules of your own.

1
Entering edit mode
13 months ago
ATpoint 57k

I was thinking similar to Mensur Dlakic, but from the nucleotide perspective. Even though IUPAC allows by best knowledge 16 characters for nucleotide fasta the majority will always be A/T/C/G/N, even in noisy Sanger sequencing data. I include N explicitely because in genome fasta files repetitive regions, especially the telomers often consist of large N stretches. Therefore, if a selection of the fasta, say the first 100 characters, maybe of the first 10 entries if being multi-fasta, consist of more than x% A/T/C/G/N characters then call it fasta, else call it amino acid. One can probably calibrate this by randomly pulling some fasta files from NCBI, maybe from the nucleotide collection, plus the same from an amino acid collection, and then derive an expected composition of sum(A/T/C/G/N) versus all other characters. This will be rather bimodal I guess so finding a good cutoff is probably not too difficult as Mensur Dlakic suggested in his second last sentence.