I'm trying to create a pair of bash commands (or single command) to:
(1) Extract $ACCESSION from a FASTA header from the format
>$ACCESSION Genus species strain
It is always followed by space and contains decimal and number at end. EX: NC123456.7
(2) Add $GI to the same FASTA header in the format
>gi|$GI|ACCESSION Genus species strain
...essentially adding the GI, prefix, and pipes to the header from (1).
In between these two commands I have already figured out how to query the GI from ACCESSION using:
Can you please give me an example of the most efficient way to complete this task? Much appreciated in advance!
EDIT: I should mention that I need to keep this task within the confines of a single shell script. I am also downloading genome assemblies as a multi-seq FASTA, splitting them (already done), but need to add GIs to the headers for taxon mapping. There are hundreds of assemblies with many contigs each.