Bash extract accession from FASTA header and add GI
4
0
Entering edit mode
9.9 years ago
anon • 0

Hi all,

I'm trying to create a pair of bash commands (or single command) to:

(1) Extract $ACCESSION from a FASTA header from the format

>$ACCESSION Genus species strain

It is always followed by space and contains decimal and number at end. EX: NC123456.7

(2) Add $GI to the same FASTA header in the format

>gi|$GI|ACCESSION Genus species strain

...essentially adding the GI, prefix, and pipes to the header from (1).

In between these two commands I have already figured out how to query the GI from ACCESSION using:

GI=$(curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=gi)

Can you please give me an example of the most efficient way to complete this task? Much appreciated in advance!

EDIT: I should mention that I need to keep this task within the confines of a single shell script. I am also downloading genome assemblies as a multi-seq FASTA, splitting them (already done), but need to add GIs to the headers for taxon mapping. There are hundreds of assemblies with many contigs each.

bash gi accession FASTA • 6.0k views
ADD COMMENT
1
Entering edit mode

It would be much easier to use biopython or bioperl.

ADD REPLY
2
Entering edit mode
9.9 years ago

The answers suggesting using bio{awk,perl,python} are correct; doing even simple things like this yourself as a one-off is surprisingly fraught with danger (what's the right regex for an accession number? Well, it depends.)

Still, as a learning exercise this can be done pretty simply using bash+sed:

#!/bin/bash
APIBASE="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gi"
NOMATCH="ID list is empty"
while read line
do
    acc=$( echo $line | sed  -E -e '/^[^>]/d' -e 's/>([[:alnum:]_]+\.[[:digit:]]+).*/\1/' )
    if [ -z "$acc" ]
    then
        echo $line
    else
        gi=$( curl "${APIBASE}&id=${acc}" 2> /dev/null )
        if [[ $gi =~ $NOMATCH ]]
        then
            echo $line
        else
            restofline=$( echo $line | sed -e 's/^>\(.*\)/\1/' )
            echo ">gi|${gi}|${restofline}"
        fi
    fi
done < "${1:-/dev/stdin}" > "${2:-/dev/stdout}"​

This will work with either BSD or GNU sed (e.g. mac or linux). Playing with a made-up fasta file:

$ cat sample.fasta
>NC_001477.1 foo bar baz
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT
CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT
GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC
TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA
>U49845.1 something somethingelse quux
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
>EASYAS123.45 not valid at all
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG​

You can run this as

$ ./gi-from-accession.sh sample.fasta new.fasta

or

$ cat sample.fasta | ./gi-from-accession.sh > newer.fasta

and get

$ cat newer.fasta
>gi|9626685|NC_001477.1 foo bar baz
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT
CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT
GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC
TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA
>gi|1293613|U49845.1 something somethingelse quux
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
>EASYAS123.45 not valid at all
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
ADD COMMENT
1
Entering edit mode
9.9 years ago
aliceb ▴ 10

Hello,

I just had a similar issue with a large multi fasta file, and I was able to solve it using FaBox online.

Not a very good exercise in coding, but great for those of us that are perhaps a bit more lazy!

http://users-birc.au.dk/biopv/php/fabox/

-Alice

ADD COMMENT
0
Entering edit mode

The tools looks simple, Alice. Thank you for sharing it here. I wish they had a command line version of it, though. While uploading one huge multi-fasta file is simple enough, processing a batch of multi-fasta files is a huge pain with online tools such as these..

ADD REPLY
0
Entering edit mode
9.9 years ago
Ram 44k

As Devon Ryan said, use BioPython or BioPerl.

Workflow:

For each sequence in input file,

  1. Pick up the header
  2. Query the URL with the accession num from the header and add GI to create new header string.
  3. Write out above created header string
  4. Write out sequence
ADD COMMENT
0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

I'd prefer to use bash, but if there's a way to call on perl within a shell script, that would work. However, I don't have any experience with perl. Any ideas?

ADD REPLY
0
Entering edit mode

BASH makes this task quite complicated. You'd have to curl the URL and set a variable to hold the result then awk or sed to replace the old header with a newly crafted header.

BioPerl is quite straightforward, and so is BioPython. You might wanna learn them, as you're gonna absolutely need them at some point soon. Is it just Perl you're unfamiliar with, or are you new to programming?

ADD REPLY
0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Not new to programming, but new to python, perl, and bash. BioPython seems like the way to go, but my superiors don't agree that learning and tinkering with new languages is time well spent. To them, learning new bench experiments is more worthwhile.

ADD REPLY
0
Entering edit mode

If you're not new to programming, then you're just familiarizing yourself with syntax. Implementation logic should come naturally to you. And no time spent on Python or Perl is wasted. These languages are the natural next step from BASH.

ADD REPLY
0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Thanks. You and I are on the same page.

ADD REPLY
0
Entering edit mode
9.9 years ago
onuralp ▴ 190

As mentioned, there are easier ways to do this using Bioperl / Biopython - take a look at them, it is definitely worth your time.

Using bash alone, this should work in principle (i.e., not tested):

#!/bin/sh
FASTA=~/test/sequences.fa # original file containing sequences or accession ids
ACCESSION=$(grep -o -E "^>\w+" $FASTA | tr -d ">") # extract accession id
echo "accession is $ACCESSION" # for visual inspection (optional)
curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=fasta > ~/test/$ACCESSION.fasta
ADD COMMENT
0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Thank you for the help with grep. There are thousands of FASTAs, so redownloading each entry with FASTA return type would be inefficient in this case. My script already has to download gigabytes of gzips, unpack them, and split into single FASTAs in order to get the acccessions and sequences. Querying the GI from the accession followed by adding the GI to the header seems to be the most efficient means.

ADD REPLY
1
Entering edit mode

Absolutely - efficiency is precisely the reason why we have referred you to other tools that are more adequate for tasks like sequence manipulation. :) In any case, Jonathan's answer must be what you're looking for.

ADD REPLY

Login before adding your answer.

Traffic: 1007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6