Question

Tokenize FASTA Header To Separate Protein Annotation and Organism Name

1

Entering edit mode

3.5 years ago

taojincs ▴ 50

Given this fasta header as an example, ">WP_024130427.1 [50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]", my goal is to extract accession number, protein annotation, and organism name separately.

Right now, I am able to extract the accession number >WP_024130427.1 and "[50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]".

I have problems separating the remaining two parts: protein annotation ([50S ribosomal protein L16]-arginine 3-hydroxylase), organism name ("Citrobacter koseri").

The main issue is about square brackets. For this example, it is easy to tokenize the parts. However, given the various style of using square brackets, e.g. (so many variations to consider and below is not comprehensive and I am able to extract from the content below. Still my tokenization doesn't work for all the sequences.),

>WP_011200935.1 cysteine synthase A [[Mannheimia] succiniciproducens]
>WP_024130427.1 [50S ribosomal protein L16]-arginine 3-hydroxylase [Citrobacter koseri]
>WP_011742684.1 [FeFe] hydrogenase H-cluster radical SAM maturase HydG [Caldanaerobacter subterraneus]

Is there a better way to extract the annotation and organism names respectively given the unpredictable usage of the square brackets?

Right now, the only way I can think of is to go backward from the ending of the string instead of using the pattern. Ensure numbers of ] and [ match. This method will work but I am wondering if there will be better ways.

fasta protein tokenization • 724 views

ADD COMMENT • link 3.5 years ago by taojincs ▴ 50

score 1 · Answer 1 · 2020-10-08

As you see in the examples above organism name is in the last set of [ ..]. So consider everything in between those and accession number "annotation". You are also working with WP* accessions which represents multiple organisms.

Because a non-redundant protein sequence may be found in RefSeq genomes from multiple species, the organism information provided on the protein record reflects the lowest-common taxonomic node ranging from the genus species level to super-kingdom.