Question: Downloading Gene Xml From Entrez
1
gravatar for Daniel Standage
7.5 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

I recently downloaded gene annotations from NCBI Entrez in XML format. I'm trying to figure out an alternative process I can use for retrieving these same data that:

  1. can be done from the command line; and
  2. will return the same data each time, even if NCBI's annotations are updated or improved

Here is the process I used initially to grab the data.

  1. Followed a link to an NCBI genome page
  2. Clicked on Gene under Related information
  3. Clicked "Send to: File", selected XML, and hit save

Is there any alternative way I can do this that meets the criteria I listed above?

ncbi entrez • 5.7k views
ADD COMMENTlink modified 5.9 years ago by Biostar ♦♦ 20 • written 7.5 years ago by Daniel Standage3.9k
5
gravatar for Pierre Lindenbaum
7.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

Use NCBI efetch

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=2&retmode=xml"

Update: to get all the records:

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Pierre Lindenbaum120k

Is there a way to request 42,876 genes with a single query? As the question suggests, I'm looking for a single file, not one for each gene.

ADD REPLYlink written 7.5 years ago by Daniel Standage3.9k

I've updated my answer.

ADD REPLYlink written 7.5 years ago by Pierre Lindenbaum120k

Can you explain what this means: "put each XML in a database" geneid2xml

which db would you suggest. ? What is the "geneid2xml"? It does not seem to be a database vendor.

ADD REPLYlink written 7.1 years ago by user56290

that could be any SQL database with primary-key=gene-id and value=xml, or a XML-oriented database like eXist

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum120k
3
gravatar for Chris Maloney
7.4 years ago by
Chris Maloney330
Bethesda, MD
Chris Maloney330 wrote:

This is a lot of genes! The results XML file is about 1.8G. I found that if you pass all of the IDS to efetch at once, you get a server error.

So here's the strategy I came up with. You can use a shell script that uses elink to get the list of IDs, in an XML file; a perl script to extract just the result IDs and put them into a set of text files. The text files, each with a limited number (1000) of gene ids can then be sent to efetch, one at a time, to get the gene xml files. This results in about 42 separate xml files, each with info about 1000 genes. It may be that efetch will accept more than 1000 ids at once -- I didn't experiment to see what the practical limit was.

Here's the shell script:

#!/usr/bin/bash

# Fetch the list of uids:
curl http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=genome\&linkname=genome_gene\&id=5 \
    > ids.xml

# Mung with perl to extract the list of gene IDs.
# This produces a bunch of ids-nnn.txt text files.
perl getids.pl ids.xml

# For each of these text files, invoke efetch.
for file in ids-*.txt ; do
    outfile=${file%%.*}.xml
    echo "Getting gene info for $file"
    curl -X POST -d @$file \
        http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene\&retmode=xml \
        > $outfile
done

Here's the Perl script that munges the ID results:

    #!/usr/bin/perl

    # The output will be split into separate files with this number of genes IDs in
    # each
    my $genesPerFile = 1000;

    # We'll discard the first <Id> in the input file; this flag keeps track of that.
    my $first = 1;

    # Number the output files starting with 0, and keep track of the number of
    # gene ids written so far.
    my $outFileNum = 0;
    my $geneNum = 0;

    open OUT, ">ids-$outFileNum.txt";

    # First print out the query parameter name and the equals sign
    print OUT "id=";

    # For each line of input
    while (<>) {
        # Get rid of the newline at the end of every line
        chomp;
        # We're only interested in lines that have <Id>
        if (/\<Id\>/) {
            # Discard the first one
            if ($first) {
                $first = 0;
            }

            else {
                # Extract just the numeric value
                s/^.*\<Id\>([0-9]+)\<\/Id\>.*$/$1/;

                # Print it out with a trailing comma
                print OUT $_ . ",";

                # If we've reached the limit for this output file, close it
                # and open the next one.
                $geneNum++;
                if ($geneNum == $genesPerFile) {
                    $geneNum = 0;
                    print OUT "\n";
                    close OUT;
                    $outFileNum++;
                    open OUT, ">ids-$outFileNum.txt";
                    print OUT "id=";
                }
            }
        }
    }

    print OUT "\n";
    close OUT;
ADD COMMENTlink written 7.4 years ago by Chris Maloney330
3
gravatar for Neilfws
7.4 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

EUtils is definitely your friend for this task. Here's some code using BioRuby's EUtils library. Note that the library does not implement Elink, so we have to do that ourselves. Note also that I used a much smaller genome (Nanoarchaeum equitans, ID = 1122) in this example, just for testing quickly.

#!/usr/bin/ruby

require "rubygems"  # ruby 1.8
require "bio"
require "crack"
require "open-uri"

Bio::NCBI.default_email = "me@me.com"

ncbi = Bio::NCBI::REST.new
base = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
suff = "elink.fcgi?dbfrom=genome&db=gene&id=1122"

# get an array of Gene IDs using Elink
xml = open("#{base + suff}").read
xml = Crack::XML.parse(xml)  # parse XML to a hash
ids = xml['eLinkResult']['LinkSet']['LinkSetDb']['Link'].map {|x| x['Id']}

# fetch Gene records as XML
result = ncbi.efetch(ids.join(","), {"db" => "gene", "retmode" => "xml"})

# write to file
File.open("result.xml", "w") do |f|
  f.write(result)
end

A couple of notes:

  • I don't believe there's a limit to the number of IDs that Elink can return, but it may take a while for _Glycine max_ (42876 IDs)
  • The EUtils documentation states: "There is no set maximum for the number of UIDs that can be passed to EFetch, but if more than about 200 UIDs are to be provided, the request should be made using the HTTP POST method." I don't know whether BioRuby efetch() handles this automatically (it had no problems with the 540 IDs in my example).

It remains to be seen how well this code would cope with the ~ 1.8 GB XML download for your genome. It might be better to split the output using retstart as described in the NCBI EUtils documentation.

ADD COMMENTlink written 7.4 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 951 users visited in the last hour