Question: Bash extract accession from FASTA header and add GI
0
gravatar for anon
4.5 years ago by
anon10
United States
anon10 wrote:

Hi all,

I'm trying to create a pair of bash commands (or single command) to:

(1) Extract $ACCESSION from a FASTA header from the format

>$ACCESSION Genus species strain

It is always followed by space and contains decimal and number at end. EX: NC123456.7

(2) Add $GI to the same FASTA header in the format

>gi|$GI|ACCESSION Genus species strain

...essentially adding the GI, prefix, and pipes to the header from (1).

In between these two commands I have already figured out how to query the GI from ACCESSION using:

GI=$(curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=gi)

Can you please give me an example of the most efficient way to complete this task? Much appreciated in advance!

EDIT: I should mention that I need to keep this task within the confines of a single shell script. I am also downloading genome assemblies as a multi-seq FASTA, splitting them (already done), but need to add GIs to the headers for taxon mapping. There are hundreds of assemblies with many contigs each.

accession bash gi fasta • 2.9k views
ADD COMMENTlink modified 4.5 years ago by aliceb10 • written 4.5 years ago by anon10
1

It would be much easier to use biopython or bioperl.

ADD REPLYlink written 4.5 years ago by Devon Ryan89k
2
gravatar for Jonathan Dursi
4.5 years ago by
Toronto
Jonathan Dursi270 wrote:

The answers suggesting using bio{awk,perl,python} are correct; doing even simple things like this yourself as a one-off is surprisingly fraught with danger (what's the right regex for an accession number?  Well, it depends.)

Still, as a learning exercise this can be done pretty simply using bash+sed:

#!/bin/bash
APIBASE="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gi"
NOMATCH="ID list is empty"
while read line
do
    acc=$( echo $line | sed  -E -e '/^[^>]/d' -e 's/>([[:alnum:]_]+\.[[:digit:]]+).*/\1/' )
    if [ -z "$acc" ]
    then
        echo $line
    else
        gi=$( curl "${APIBASE}&id=${acc}" 2> /dev/null )
        if [[ $gi =~ $NOMATCH ]]
        then
            echo $line
        else
            restofline=$( echo $line | sed -e 's/^>\(.*\)/\1/' )
            echo ">gi|${gi}|${restofline}"
        fi
    fi
done < "${1:-/dev/stdin}" > "${2:-/dev/stdout}"​

This will work with either BSD or GNU sed (e.g. mac or linux).  Playing with a made-up fasta file:

$ cat sample.fasta
>NC_001477.1 foo bar baz
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT
CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT
GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC
TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA
>U49845.1 something somethingelse quux
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
>EASYAS123.45 not valid at all
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG​

You can run this as

$ ./gi-from-accession.sh sample.fasta new.fasta

or

$ cat sample.fasta | ./gi-from-accession.sh > newer.fasta

and get

$ cat newer.fasta
>gi|9626685|NC_001477.1 foo bar baz
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT
CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT
GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC
TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA
>gi|1293613|U49845.1 something somethingelse quux
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
>EASYAS123.45 not valid at all
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Jonathan Dursi270
1
gravatar for aliceb
4.5 years ago by
aliceb10
Switzerland
aliceb10 wrote:

Hello,

I just had a similar issue with a large multi fasta file, and I was able to solve it using FaBox online.

Not a very good exercise in coding, but great for those of us that are perhaps a bit more lazy!

http://users-birc.au.dk/biopv/php/fabox/

-Alice

 

ADD COMMENTlink written 4.5 years ago by aliceb10

The tools looks simple, Alice. Thank you for sharing it here. I wish they had a command line version of it, though. While uploading one huge multi-fasta file is simple enough, processing a batch of multi-fasta files is a huge pain with online tools such as these..

ADD REPLYlink written 4.5 years ago by RamRS21k
0
gravatar for RamRS
4.5 years ago by
RamRS21k
Houston, TX
RamRS21k wrote:

As Devon Ryan said, use BioPython or BioPerl.

Workflow:

For each sequence in input file,

a. Pick up the header

b. Query the URL with the accession num from the header and add GI to create new header string.

c. Write out above created  header string

d. Write out sequence

 

ADD COMMENTlink written 4.5 years ago by RamRS21k

redacted redacted redacted

ADD REPLYlink modified 12 months ago • written 4.5 years ago by anon10

BASH makes this task quite complicated. You'd have to curl the URL and set a variable to hold the result then awk or sed to replace the old header with a newly crafted header.

BioPerl is quite straightforward, and so is BioPython. You might wanna learn them, as you're gonna absolutely need them at some point soon. Is it just Perl you're unfamiliar with, or are you new to programming?

ADD REPLYlink written 4.5 years ago by RamRS21k

redacted redacted redacted

ADD REPLYlink modified 12 months ago • written 4.5 years ago by anon10

If you're not new to programming, then you're just familiarizing yourself with syntax. Implementation logic should come naturally to you. And no time spent on Python or Perl is wasted. These languages are the natural next step from BASH.

ADD REPLYlink written 4.5 years ago by RamRS21k
1

redacted redacted redacted

ADD REPLYlink modified 12 months ago • written 4.5 years ago by anon10
0
gravatar for onuralp
4.5 years ago by
onuralp180
Turkey
onuralp180 wrote:

As mentioned, there are easier ways to do this using Bioperl / Biopython - take a look at them, it is definitely worth your time. 

Using bash alone, this should work in principle (i.e., not tested): 

#!/bin/sh

FASTA=~/test/sequences.fa # original file containing sequences or accession ids

ACCESSION=$(grep -o -E "^>\w+" $FASTA | tr -d ">") # extract accession id

echo "accession is $ACCESSION" # for visual inspection (optional)

curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=fasta > ~/test/$ACCESSION.fasta

 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by onuralp180

redacted redacted redacted

ADD REPLYlink modified 12 months ago • written 4.5 years ago by anon10
1

Absolutely - efficiency is precisely the reason why we have referred you to other tools that are more adequate for tasks like sequence manipulation. :) In any case, Jonathan's answer must be what you're looking for. 

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by onuralp180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1858 users visited in the last hour