Question

Downloading Gene Xml From Entrez

1

Entering edit mode

12.3 years ago

Daniel Standage 4.1k

I recently downloaded gene annotations from NCBI Entrez in XML format. I'm trying to figure out an alternative process I can use for retrieving these same data that:

can be done from the command line; and
will return the same data each time, even if NCBI's annotations are updated or improved

Here is the process I used initially to grab the data.

Followed a link to an NCBI genome page
Clicked on Gene under Related information
Clicked "Send to: File", selected XML, and hit save

Is there any alternative way I can do this that meets the criteria I listed above?

ncbi entrez • 7.8k views

ADD COMMENT • link updated 10.7 years ago by Biostar 20 • written 12.3 years ago by Daniel Standage 4.1k

Ram · Answer 1 · 2012-01-05

5

Entering edit mode

12.3 years ago

Pierre Lindenbaum 161k

Use NCBI efetch

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=xml"

Update: to get all the records:

get gene2xml (which is part of the NCBI toolkit)
get the ASN file for the NCBI genes under ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ . sync this file daily if needed.
transform the ASN to XML with gene2xml
if you want to restrict your data to one or more gene, use a SAX or Stax parser to process the XML (for example see: this post and the schema: http://www.ncbi.nlm.nih.gov/data_specs/dtd/NCBI_Entrezgene.dtd ) or put each XML in a database geneid2xml .

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Is there a way to request 42,876 genes with a single query? As the question suggests, I'm looking for a single file, not one for each gene.

ADD REPLY • link 12.3 years ago by Daniel Standage 4.1k

0

Entering edit mode

I've updated my answer.

ADD REPLY • link 12.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Can you explain what this means: "put each XML in a database" geneid2xml

which db would you suggest. ? What is the "geneid2xml"? It does not seem to be a database vendor.

ADD REPLY • link 12.0 years ago by user56 ▴ 300

0

Entering edit mode

that could be any SQL database with primary-key=gene-id and value=xml, or a XML-oriented database like eXist

ADD REPLY • link 12.0 years ago by Pierre Lindenbaum 161k

score 3 · Answer 2 · 2012-01-07

This is a lot of genes! The results XML file is about 1.8G. I found that if you pass all of the IDS to efetch at once, you get a server error.

So here's the strategy I came up with. You can use a shell script that uses elink to get the list of IDs, in an XML file; a perl script to extract just the result IDs and put them into a set of text files. The text files, each with a limited number (1000) of gene ids can then be sent to efetch, one at a time, to get the gene xml files. This results in about 42 separate xml files, each with info about 1000 genes. It may be that efetch will accept more than 1000 ids at once -- I didn't experiment to see what the practical limit was.

Here's the shell script:

#!/usr/bin/bash

# Fetch the list of uids:
curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=genome\&linkname=genome_gene\&id=5 \
    > ids.xml

# Mung with perl to extract the list of gene IDs.
# This produces a bunch of ids-nnn.txt text files.
perl getids.pl ids.xml

# For each of these text files, invoke efetch.
for file in ids-*.txt ; do
    outfile=${file%%.*}.xml
    echo "Getting gene info for $file"
    curl -X POST -d @$file \
        http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene\&retmode=xml \
        > $outfile
done

Here's the Perl script that munges the ID results:

    #!/usr/bin/perl

    # The output will be split into separate files with this number of genes IDs in
    # each
    my $genesPerFile = 1000;

    # We'll discard the first <Id> in the input file; this flag keeps track of that.
    my $first = 1;

    # Number the output files starting with 0, and keep track of the number of
    # gene ids written so far.
    my $outFileNum = 0;
    my $geneNum = 0;

    open OUT, ">ids-$outFileNum.txt";

    # First print out the query parameter name and the equals sign
    print OUT "id=";

    # For each line of input
    while (<>) {
        # Get rid of the newline at the end of every line
        chomp;
        # We're only interested in lines that have <Id>
        if (/\<Id\>/) {
            # Discard the first one
            if ($first) {
                $first = 0;
            }

            else {
                # Extract just the numeric value
                s/^.*\<Id\>([0-9]+)\<\/Id\>.*$/$1/;

                # Print it out with a trailing comma
                print OUT $_ . ",";

                # If we've reached the limit for this output file, close it
                # and open the next one.
                $geneNum++;
                if ($geneNum == $genesPerFile) {
                    $geneNum = 0;
                    print OUT "\n";
                    close OUT;
                    $outFileNum++;
                    open OUT, ">ids-$outFileNum.txt";
                    print OUT "id=";
                }
            }
        }
    }

    print OUT "\n";
    close OUT;

score 3 · Answer 3 · 2012-01-09

EUtils is definitely your friend for this task. Here's some code using BioRuby's EUtils library. Note that the library does not implement Elink, so we have to do that ourselves. Note also that I used a much smaller genome (Nanoarchaeum equitans, ID = 1122) in this example, just for testing quickly.

#!/usr/bin/ruby

require "rubygems"  # ruby 1.8
require "bio"
require "crack"
require "open-uri"

Bio::NCBI.default_email = "me@me.com"

ncbi = Bio::NCBI::REST.new
base = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
suff = "elink.fcgi?dbfrom=genome&db=gene&id=1122"

# get an array of Gene IDs using Elink
xml = open("#{base + suff}").read
xml = Crack::XML.parse(xml)  # parse XML to a hash
ids = xml['eLinkResult']['LinkSet']['LinkSetDb']['Link'].map {|x| x['Id']}

# fetch Gene records as XML
result = ncbi.efetch(ids.join(","), {"db" => "gene", "retmode" => "xml"})

# write to file
File.open("result.xml", "w") do |f|
  f.write(result)
end

A couple of notes:

I don't believe there's a limit to the number of IDs that Elink can return, but it may take a while for _Glycine max_ (42876 IDs)
The EUtils documentation states: "There is no set maximum for the number of UIDs that can be passed to EFetch, but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method." I don't know whether BioRuby efetch() handles this automatically (it had no problems with the 540 IDs in my example).

It remains to be seen how well this code would cope with the ~ 1.8 GB XML download for your genome. It might be better to split the output using retstart as described in the NCBI EUtils documentation.