Question

How Do I Retrieve The Protein Gi Numbers Given The Taxids?

2

Entering edit mode

12.1 years ago

Tawny ▴ 180

I have a list of taxon ids and I need to get a list of all of the protein gi numbers for those taxa. The taxon ID list contains over 1700 entries. I have tried to use the soultion provided here: http://biostar.stackexchange.com/questions/17761/refseq-proteins-for-a-given-taxid but this uses Python which is not available on the hpc I use. I can use Perl/BioPerl and have tried to understand the EUtilities package but I am not sure how exactly to get the data I need.
Has anyone done this before?

protein • 6.2k views

ADD COMMENT • link updated 12.1 years ago by Fallino ▴ 20 • written 12.1 years ago by Tawny ▴ 180

score 5 · Answer 1 · 2012-03-17

You can download and uncompress the gitaxiprot.dmp.gz (or zip file) from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

Then you can obtain all the GIs for a given taxon (say 9606) with a simple:

$ gawk '$2==9606{print $1}' gi_taxid_prot.dmp

or if you prefer to use Perl:

$ perl -lane 'print $F[0] if ($F[1] == 9606)' gi_taxid_prot.dmp

You can even create a bash script:

$ cat > taxid2gi.sh
#!/bin/sh
gawk -v taxid="$1" '$2==taxid{print $1}' $2
^D

Make it executable

$ chmod +x taxid2gi.sh

And run it with any taxid:

./taxid2gi.sh 9606 gi_taxid_prot.dmp > 9606_gis.txt
./taxid2gi.sh 10090 gi_taxid_prot.dmp > 10090_gis.txt

If you have a lot of taxids to check, create a simple Perl script like:

use strict;
use warnings;

# First argument a file with your taxids (1 per line)
my ($taxids, $gitaxid_file) = @ARGV;

open my $taxids_fh, "<", $taxids or die $!;
my %taxids = map {chomp; $_ => 1} (<$taxids_fh>);
close($taxids_fh);

open my $gitaxid_fh, "<", $gitaxid_file or die $!;
while (<$gitaxid_fh>) {
    chomp;
    my ($gi, $taxid) = split;
    if (defined $taxids{$taxid}) {
    print "$_\n";
    }
}
close($gitaxid_fh);

M;

score 2 · Answer 2 · 2012-03-16

An easy way is to go to entrez proteins and use:

txid9606 AND srcdb_refseq[properties]

This gets you all human proteins from RefSeq. If possible, you need srcdb_refseq[properties] to get a sensible set; without that, you would get almost 600,000 entries.

Once you have done this, you can download to a file and select gi numbers.

Alternatively, for eutils, try some code that looks like this (perl), which is designed to get gi numbers using a file of the form:

taxon_id t organism_name [?][?] my $db = 'protein'; my $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';

while (my $tx_line = <> ) { chomp ($tx_line); next unless ($tx_line);

my ($taxon, $descr) = split("\t",$tx_line);
my ($sname) = ($descr =~ m/^(\w+)/);
$sname = lc($sname);

my $out_file = $taxon . "_" . $sname . ".gi";

open(FOUT,">$out_file") || die "cannot open $out_file";

my $query = "srcdb_refseq[prop]+AND+$taxon"."[orgn]";

my $url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

#post the esearch URL                                                                                                         
my $esearch_result = get($url);

my ($count, $querykey, $webenv) = ($esearch_result =~
                                   m|<Count>(\d+)</Count>.*<QueryKey>(\d+)</QueryKey>.*<WebEnv>(\S+)</WebEnv>|s);

if ($count < 1) { return "";}

my $retmax=1000;

my $efetch = "";

for(my $retstart = 0; $retstart < $count; $retstart += $retmax) {
    $url = $base . "esearch.fcgi?"
        . "retstart=$retstart&retmax=$retmax&"
        . "db=$db&query_key=$querykey&WebEnv=$webenv";
    $efetch = get($url);

# now extract the gi numbers

    my @new_gis = ($efetch =~ m/<Id>(\d+)<\/Id>/g);

    print FOUT join("\n",@new_gis) . "\n";
}
close FOUT;

} [?][?]

score 1 · Answer 3 · 2012-07-24

1

Entering edit mode

11.7 years ago

Fallino ▴ 20

GNU GREP is faster than GAWK :

grep [[:space:]]9606$ gi_taxid_prot.dmp > gis_from_taxid9606.txt
grep ^12345[[:space:]] gi_taxid_prot.dmp > taxid_from_gi12345.txt

ADD COMMENT • link 11.7 years ago by Fallino ▴ 20

score 0 · Answer 4 · 2012-03-16

I have simply translated the python script you found in the question you cite, and left out the last part where the sequence is obtained because you want the gis only. You have to install Bio::DB::SoapEUtilities. I am not 100% sure about the best query string, the original one seems to work:

#!/usr/bin/env perl 
use strict;
use warnings; 
use Bio::DB::SoapEUtilities;
# factory construction
my $fac = Bio::DB::SoapEUtilities->new();
# get a Bio::DB::SoapEUtilities::Result object
my $entrezDbName = 'protein';
my $ncbiTaxId = '1001533'; # Bovine papillomavirus 7
# Find entries matching the query
my $entrezQuery = "refseq[filter] AND txid$ncbiTaxId";
my $result = $fac->esearch(
               -email => 'bla\@blub.org',
               -db => 'protein',
               -term => $entrezQuery)->run;
print join "\n", @{$result->ids()}, "\n";