4
0
Entering edit mode
7.5 years ago
anon • 0

Hi all,

I'm trying to create a pair of bash commands (or single command) to:

(1) Extract $ACCESSION from a FASTA header from the format >$ACCESSION Genus species strain


It is always followed by space and contains decimal and number at end. EX: NC123456.7

(2) Add $GI to the same FASTA header in the format >gi|$GI|ACCESSION Genus species strain


...essentially adding the GI, prefix, and pipes to the header from (1).

In between these two commands I have already figured out how to query the GI from ACCESSION using:

GI=$(curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=gi)


Can you please give me an example of the most efficient way to complete this task? Much appreciated in advance!

EDIT: I should mention that I need to keep this task within the confines of a single shell script. I am also downloading genome assemblies as a multi-seq FASTA, splitting them (already done), but need to add GIs to the headers for taxon mapping. There are hundreds of assemblies with many contigs each.

bash gi accession FASTA • 4.5k views
1
Entering edit mode

It would be much easier to use biopython or bioperl.

2
Entering edit mode
7.5 years ago

The answers suggesting using bio{awk,perl,python} are correct; doing even simple things like this yourself as a one-off is surprisingly fraught with danger (what's the right regex for an accession number? Well, it depends.)

Still, as a learning exercise this can be done pretty simply using bash+sed:

#!/bin/bash
APIBASE="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=gi"
NOMATCH="ID list is empty"
do
acc=$( echo$line | sed  -E -e '/^[^>]/d' -e 's/>([[:alnum:]_]+\.[[:digit:]]+).*/\1/' )
if [ -z "$acc" ] then echo$line
else
gi=$( curl "${APIBASE}&id=${acc}" 2> /dev/null ) if [[$gi =~ $NOMATCH ]] then echo$line
else
restofline=$( echo$line | sed -e 's/^>$$.*$$/\1/' )
echo ">gi|${gi}|${restofline}"
fi
fi
done < "${1:-/dev/stdin}" > "${2:-/dev/stdout}"​


This will work with either BSD or GNU sed (e.g. mac or linux). Playing with a made-up fasta file:

$cat sample.fasta >NC_001477.1 foo bar baz ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA >U49845.1 something somethingelse quux GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG >EASYAS123.45 not valid at all GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG​  You can run this as $ ./gi-from-accession.sh sample.fasta new.fasta


or

$cat sample.fasta | ./gi-from-accession.sh > newer.fasta  and get $ cat newer.fasta
>gi|9626685|NC_001477.1 foo bar baz
ATGAACAACCAACGGAAAAAGACGGGTCGACCGTCTTTCAATATGCTGAAACGCGCGAGAAACCGCGTGT
CAACTGTTTCACAGTTGGCGAAGAGATTCTCAAAAGGATTGCTTTCAGGCCAAGGACCCATGAAATTGGT
GATGGCTTTTATAGCATTCCTAAGATTTCTAGCCATACCTCCAACAGCAGGAATTTTGGCTAGATGGGGC
TCATTCAAGAAGAATGGAGCGATCAAAGTGTTACGGGGTTTCAAGAAAGAAATCTCAAACATGTTGAACA
>gi|1293613|U49845.1 something somethingelse quux
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG
>EASYAS123.45 not valid at all
GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG
ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA
GTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACAACATATGTAACATA
TTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATTAGAAACAGAACGCAAAAA
TTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCGTCATAGAACTTTTGGCAATTCGCG

1
Entering edit mode
7.5 years ago
aliceb ▴ 10

Hello,

I just had a similar issue with a large multi fasta file, and I was able to solve it using FaBox online.

Not a very good exercise in coding, but great for those of us that are perhaps a bit more lazy!

http://users-birc.au.dk/biopv/php/fabox/

-Alice

0
Entering edit mode

The tools looks simple, Alice. Thank you for sharing it here. I wish they had a command line version of it, though. While uploading one huge multi-fasta file is simple enough, processing a batch of multi-fasta files is a huge pain with online tools such as these..

0
Entering edit mode
7.5 years ago
Ram 36k

As Devon Ryan said, use BioPython or BioPerl.

Workflow:

For each sequence in input file,

2. Query the URL with the accession num from the header and add GI to create new header string.
3. Write out above created header string
4. Write out sequence
0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

I'd prefer to use bash, but if there's a way to call on perl within a shell script, that would work. However, I don't have any experience with perl. Any ideas?

0
Entering edit mode

BASH makes this task quite complicated. You'd have to curl the URL and set a variable to hold the result then awk or sed to replace the old header with a newly crafted header.

BioPerl is quite straightforward, and so is BioPython. You might wanna learn them, as you're gonna absolutely need them at some point soon. Is it just Perl you're unfamiliar with, or are you new to programming?

0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Not new to programming, but new to python, perl, and bash. BioPython seems like the way to go, but my superiors don't agree that learning and tinkering with new languages is time well spent. To them, learning new bench experiments is more worthwhile.

0
Entering edit mode

If you're not new to programming, then you're just familiarizing yourself with syntax. Implementation logic should come naturally to you. And no time spent on Python or Perl is wasted. These languages are the natural next step from BASH.

0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Thanks. You and I are on the same page.

0
Entering edit mode
7.5 years ago
onuralp ▴ 190

As mentioned, there are easier ways to do this using Bioperl / Biopython - take a look at them, it is definitely worth your time.

Using bash alone, this should work in principle (i.e., not tested):

#!/bin/sh
FASTA=~/test/sequences.fa # original file containing sequences or accession ids
ACCESSION=$(grep -o -E "^>\w+"$FASTA | tr -d ">") # extract accession id
echo "accession is $ACCESSION" # for visual inspection (optional) curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=$ACCESSION&rettype=fasta > ~/test/\$ACCESSION.fasta

0
Entering edit mode

redacted redacted redacted

Original content (from https://web.archive.org/web/20150929073744/https://www.biostars.org/p/117734/):

Thank you for the help with grep. There are thousands of FASTAs, so redownloading each entry with FASTA return type would be inefficient in this case. My script already has to download gigabytes of gzips, unpack them, and split into single FASTAs in order to get the acccessions and sequences. Querying the GI from the accession followed by adding the GI to the header seems to be the most efficient means.

1
Entering edit mode

Absolutely - efficiency is precisely the reason why we have referred you to other tools that are more adequate for tasks like sequence manipulation. :) In any case, Jonathan's answer must be what you're looking for.