Question

How to distinguish nucleic acid and amino acid representation in FASTA files?

1

Entering edit mode

4.5 years ago

Johannes W. Dietrich • 0

The FASTA format is used for representing both sequences of nucleic acids and amino acids. Since the one-letter code for nucleic acids largely overlaps with that for amino acids considerable ambiguities exist when reading FASTA files. As far as I could find out there is no standardised information on the type of sequence available in the header line.

Is there a recommended way for automatically distinguishing nucleic acid and amino acid sequences?

FASTA • 2.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 4.5 years ago by Johannes W. Dietrich • 0

0

Entering edit mode

Interesting anecdote about fasta format by Bill Pearson (inventor) here: C: Was FASTA ever popular?

ADD REPLY • link 4.5 years ago by GenoMax 152k

score 5 · Answer 1 · 2020-12-30

Leucine is the most abundant amino acid in natural proteins at ~9.5%. If you go over couple of hundred residues and there are no Ls in it, chances are solid that it is not a protein. A (alanine) and C (cysteine) are present respectively at ~7.9% and 1.5% in natural proteins, while at least one of them must be >20% in DNA. The sum A+C+G+T frequencies in DNA will be 100% or close to it, but it is <50% in most proteins.

If you think about it, I am sure you can come up some rules of your own.

score 1 · Answer 2 · 2020-12-30

I was thinking similar to Mensur Dlakic, but from the nucleotide perspective. Even though IUPAC allows by best knowledge 16 characters for nucleotide fasta the majority will always be A/T/C/G/N, even in noisy Sanger sequencing data. I include N explicitely because in genome fasta files repetitive regions, especially the telomers often consist of large N stretches. Therefore, if a selection of the fasta, say the first 100 characters, maybe of the first 10 entries if being multi-fasta, consist of more than x% A/T/C/G/N characters then call it fasta, else call it amino acid. One can probably calibrate this by randomly pulling some fasta files from NCBI, maybe from the nucleotide collection, plus the same from an amino acid collection, and then derive an expected composition of sum(A/T/C/G/N) versus all other characters. This will be rather bimodal I guess so finding a good cutoff is probably not too difficult as Mensur Dlakic suggested in his second last sentence.