Question

Bulk Download Of Ncbi Gene "Summary" Field

13

Entering edit mode

14.9 years ago

David Quigley 11k

I would like to download or manufacture a mapping of entrez gene IDs to the text that appears in the "Summary" field on an Entrez Gene query for the H. sapiens record for that gene. These short paragraphs are often useful for getting a first idea about what an unfamiliar gene does. The obvious approach of scouring ftp.ncbi.nih.gov/gene/ and ftp://ftp.ncbi.nih.gov/refseq/ for the appropriate record (e.g. gene_info.gz) didn't turn anything up. Thanks for any suggestions.

gene ncbi • 19k views

ADD COMMENT • link updated 3.6 years ago by haowen • 0 • written 14.9 years ago by David Quigley 11k

Ram · Answer 1 · 2010-08-16

Thanks for the link, Pierre. This 2.6 Gb. file is very verbose and structured for human consumption rather than easy of retrieval. I wrote a quick and dirty Python parser to pull out the summaries and am posting this so someone else doesn't have to do it too. Note that the accessions are not entrez gene ids; you have to map those separately.

f = open('refseqgene.genomic.gbff')
locus2comment = {}
in_comment=False
for line in f:
    if line[0:5] == "LOCUS":
        locus = line.split()[1]
        comment = ""
    elif line[0:7] == "COMMENT":
        in_comment=True
        comment += line.split("    ")[1].replace("\n", " ")
    elif line[0:7] == "PRIMARY":
        in_comment = False
        try:
            locus2comment[locus] = comment.split("Summary:")[1]
        except:
            locus2comment[locus] = comment
    elif in_comment:
        comment += line.split("            ")[1].replace("\n", " ")
for locus in sorted(locus2comment):
    print locus + '\t' + locus2comment[locus]

Ram · Answer 2 · 2010-08-16

7

Entering edit mode

14.9 years ago

Pierre Lindenbaum 166k

This info is in ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene.genomic.gbff.gz

ADD COMMENT • link 14.9 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

It looks like this file has been split into two now, so the URLS are:

ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene1.genomic.gbff.gz ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene2.genomic.gbff.gz

ADD REPLY • link 14.4 years ago by Richard Smith ▴ 410

0

Entering edit mode

UPDATE: as of 30th Apr 2011, there are three files ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/

ADD REPLY • link 14.2 years ago by sa9 ▴ 870

0

Entering edit mode

As of 2017 you would be better off using:

wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene.*.genomic.gbff.gz

;)

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 8.0 years ago by krassowski.michal ▴ 180

Ram · Answer 3 · 2015-02-04

The best way is to use ncbi's eutils. There is an API that returns summary by gene_id.

Here is a python script that I wrote to get summaries for gene ids specified in an input file.

import numpy as np
from os import sys, path
import pandas as pd
import urllib2
import json
import sys

if __name__=='__main__':
    gene_info_file = sys.argv[1];
    output_file = sys.argv[2];
    open(output_file, 'w').close()
    gene_ids = pd.unique(pd.read_csv(gene_info_file)['1']);
    chunk_size = 100;
    cn = len(gene_ids)/chunk_size+1
    for i in range(cn):
        chunk_genes = gene_ids[chunk_size*i:np.min([chunk_size*(i+1), len(gene_ids)])];
        gids = ','.join([str(s) for s in chunk_genes])
        print (i+1),'/',cn
        url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=&' + gids + '&retmode=json';
        print url
        data = json.load(urllib2.urlopen(url))
        result = [];
        for g in chunk_genes:
            result.append([g, data['result'][str(g)]['summary'] if str(g) in data['result'] else '']);
        pd.DataFrame(result, columns=['gene_id', 'summary']).to_csv(output_file, index=False, mode='a', header= (i==0))

Usage:

python eutilsGetSummary.py gene_ids.csv gene_summary.csv

gene_ids.csv is a csv file with the first column holding the gene_ids you want to get the summaries for.

gene_summary.csv is the output.

score 1 · Answer 4 · 2010-08-16

1

Entering edit mode

14.9 years ago

Khader Shameer 18k

You may also take a look at GeneRIF from NCBI. You can download GeneRIFs from NCBI FTP

ADD COMMENT • link 14.9 years ago by Khader Shameer 18k

score 1 · Answer 5 · 2016-05-20

(not a option for bulk downloads though) Would like to suggest UCSC Table browser (https://genome.ucsc.edu/cgi-bin/hgTables) --you might need to enter refseqgene under groups and input your gene's NM_IDS or standard HGNC symbol,output format-selected fields-- try with given examples you can figure out how to get desired output. hope this helps

Ram · Answer 6 · 2017-07-08

1

Entering edit mode

8.0 years ago

krassowski.michal ▴ 180

I was pointed to this answer some time ago, so inspired by govardhank's answer, I used UCSC hgTables to download gene summaries, indexed by RefSeq mRNA. There is a special table with summaries: hgFixed.refSeqSummary.

The gbff files (for Homo Sapiens) parsed with script from weslfield's comment gave 6 661 unique summaries. The UCSC table returned 26 140 unique summaries although these included mouse genes too (and possibly others)*; After mapping the summaries to subset of human mRNAs which I am currently working with I got 12 574 unique summaries, which doubles the gbff parsing coverage.

Also UCSC returns data for QPCT, REV1, NEB (not sure about NICK10, I couldn't find such gene) mentioned by Dave Curtis as missing in gbff files.

Feel free to use my gist for UCSC table retrieval: ucsc_download.sh. Here is the how to use it:

source ucsc_download.sh
get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip

Summary: As for 2017 please use UCSC tables, those are more complete, easier to fetch and parse. They did a really good job at making those tables.

(*) I am not sure why, but I know that it is not script-specific - I got the same when using the web interface; any advice how to avoid this would be appreciated.

ADD COMMENT • link 8.0 years ago by krassowski.michal ▴ 180

0

Entering edit mode

Hi, Michal, thank you for your awesome work. I've tried your script and run it as you put, however something just went run. Can you do me a favor?

below is how I run it and the output

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll 

total 4.0K
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh 

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % source ucsc_download.sh 

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip
--2017-08-21 20:54:51--  http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.134, 128.114.119.135,
128.114.119.133, ... Connecting to genome.ucsc.edu|128.114.119.134|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'STDOUT'
-                                                               [  <=>                                                                                                                                       ]  40.56K   109KB/s   in 0.4s
2017-08-21 20:54:52 (109 KB/s) - written to stdout [41537]
hgsid=604153833_Jsj2HlaA2zgpZxEPKxAxKQ0XgqWL&jsh_pageVerPos=0&posiion:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hga_group=genes&hga_rack=refGene&hga_able=hgFixed.refSeqSummary&hga_regionType=genome&hga_oupuType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGrea=0&boolshad.sendToGenomeSpace=0&hga_ouFileName=oupu&hga_compressType=gzip&hga_doTopSubmi=ge+oupu
--2017-08-21 20:54:52--  http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.135, 
128.114.119.133,
128.114.119.136, ... Connecting to genome.ucsc.edu|128.114.119.135|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'summary.tsv.gz'
summary.tsv.gz                                                  [  <=> ]  41.50K  85.1KB/s   in 0.5s
2017-08-21 20:54:54 (85.1 KB/s) - 'summary.tsv.gz' saved [42497]

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll 
total 48K
-rw-r--r-- 1 gongjing staff  42K Aug 21 20:54 summary.tsv.gz
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 7.9 years ago by gongjing.rss • 0

0

Entering edit mode

The output looks good. Where is the problem? Are you able to open summary.tsv.gz?

ADD REPLY • link 7.9 years ago by krassowski.michal ▴ 180

0

Entering edit mode

Hi, Michal,

I cannot unzip the file normally, and the content seems to be in HTML format. Besides, the file size is small? So I am not sure if I get the result correctly.

Here is the file information:

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll
total 48K
-rw-r--r-- 1 gongjing staff  42K Aug 21 20:54 summary.tsv.gz
-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % gunzip summary.tsv.gz
gunzip: summary.tsv.gz: not in gzip format

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % head summary.tsv.gz                                                                                                                                                     
http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>

<meta http-equiv="Content-Security-Policy" content="default-src *; script-src 'self' 'unsafe-inline' 'nonce-BccRJqnamHSWfPUAF1g3aUeqFp0u' code.jquery.com www.google-analytics.com www.samsarin.com/project/dagre-d3/latest/dagre-d3.js cdnjs.cloudflare.com/ajax/libs/d3/3.4.4/d3.min.js cdnjs.cloudflare.com/ajax/libs/jquery/1.12.1/jquery.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.2.1/jstree.min.js cdnjs.cloudflare.com/ajax/libs/bowser/1.6.1/bowser.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.4/jstree.min.js login.persona.org/include.js ajax.googleapis.com/ajax maxcdn.bootstrapcdn.com/bootstrap d3js.org/d3.v3.min.js cdn.datatables.net; style-src * 'unsafe-inline'; font-src * data:; img-src * data:;">

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;CHARSET=iso-8859-1">
<META http-equiv="Content-Script-Type" content="text/javascript">
<META HTTP-EQUIV="Pragma" CONTENT="no-cache">
<META HTTP-EQUIV="Expires" CONTENT="-1">

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 7.9 years ago by gongjing.rss • 0

0

Entering edit mode

That is strange, indeed. For me it's working and retrieves gzipped summary with weight about 5.4MB. I tested it right now on Ubuntu 17.04 with GNU Wget 1.18 and GNU sed 4.4.

Looking at your logs I found oupu instead of output, posiion instead of position. I guess either your bash or your version of sed ignores '\t' tabulation substitution, and instead cuts out some t letters.

Wild guess: https://stackoverflow.com/questions/2610115/sed-not-recognizing-t-instead-it-is-treating-it-as-t-why There are many solutions, depending on your environment (OS, sed version). Please try some, and let me know if it helped. I would start with literal tab or double escaping '\t'. Remember to source ucsc_download.sh afterwards again!

Here are my logs:

$ get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip
--2017-08-24 20:00:10--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                                        [  <=>                                                                                                                 ]  40,56K   110KB/s    in 0,4s    

2017-08-24 20:00:11 (110 KB/s) - written to stdout [41537]

hgsid=604746209_UcaSA7yJymRhoxJhcNr5WhGvCavS&jsh_pageVertPos=0&position:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=hgFixed.refSeqSummary&hgta_regionType=genome&hgta_outputType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGreat=0&boolshad.sendToGenomeSpace=0&hgta_outFileName=output&hgta_compressType=gzip&hgta_doTopSubmit=get+output
--2017-08-24 20:00:11--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘summary.tsv.gz’

summary.tsv.gz                                           [                              <=>                                                                                     ]   5,37M  1,14MB/s    in 6,2s    

2017-08-24 20:00:18 (882 KB/s) - ‘summary.tsv.gz’ saved [5635263]

ADD REPLY • link 7.9 years ago by krassowski.michal ▴ 180

Ram · Answer 7 · 2012-03-28

0

Entering edit mode

13.3 years ago

Dave Curtis • 0

This seems to work fine but many genes seem to be missing. This applies to many of the list which can be downloaded from the Human Genome Browser. Off the top of my head, examples are QPCT, REV1, NICK1, NEB. I don't see anything particularly wrong with these genes but they seem to be absent from the refSeqGene files. Can anybody explain why this is or better still point to a more comprehensive source for this information? Thanks.

ADD COMMENT • link updated 5.9 years ago by Ram 45k • written 13.3 years ago by Dave Curtis • 0

0

Entering edit mode

We at SolveBio have also run into this problem while parsing these flat files. There are some odd exceptions and omissions in these flat files. For instance, sometimes records appear with no summary or there are duplicate records. There are even instances where the gene is just absent. Here is an expansion on the code written by David Quigley that tries to account for some of this.

import sys
import gzip

def run(filepath):
    with gzip.open(filepath, 'rb') as f:
        locus2comment = {}
        in_comment = False
        first_time_symbol = False
        first_time_entrez = False
        real_gene = False
        for line in f:
            if line[0:5] == "LOCUS":
                real_gene = False
                first_time_symbol = True
                first_time_entrez = True
                locus = line.split()[1]
                comment = ""
            # elif line[0:7] == 'VERSION':
            #     print line.split('\n')
                # locus2comment[locus] = line.strip().split()[0]
            elif line[0:7] == "COMMENT":
                in_comment = True
                comment += line.split("    ")[1].replace("\n", " ")
            elif line[0:7] == "PRIMARY":
                in_comment = False
                try:
                    locus2comment[locus] = comment.split("Summary:")[1]
                except:
                    locus2comment[locus] = "Remove Me"
            elif line[0:9] == '     gene' and 'complement' not in line:
                real_gene = True
            elif line[0:27] == '                     /gene=':
                if first_time_symbol and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('/gene=\"')[1][:-1]
                    first_time_symbol = False
            elif '/db_xref="GeneID:' in line:
                if first_time_entrez and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('GeneID:')[1][:-1]
                    first_time_entrez = False
            elif in_comment:
                comment += line.split("            ")[1].replace("\n", " ")

    for locus in sorted(locus2comment):
            if "Remove Me" in locus2comment[locus]:
                del locus2comment[locus]

    with gzip.open(filepath[:-8] + '.tsv.gz', 'wb') as outfile:
        for locus in sorted(locus2comment):
            if locus2comment[locus] == "Remove Me":
                continue
            outfile.write(locus + '\t' + locus2comment[locus] + '\n')

if __name__ == '__main__':
    run(sys.argv[1])

SolveBio parses and versions this dataset along with many others popular in bioinformatics with a full API for easy access. Check it out, you may find it saves you a lot of time. https://www.solvebio.com/library/RefSeqGene

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 9.5 years ago by weslfield ▴ 90

score 0 · Answer 8 · 2012-03-30

0

Entering edit mode

13.3 years ago

J_G • 0

I agree with Dave, in the three refseqgene.genomic.gbff files there are only about 4700 genes, a handful without a entrez ID. Entrez iself and Genecards show refseq summary for all the missing genes I tried. So something is wrong.

ADD COMMENT • link 13.3 years ago by J_G • 0

score 0 · Answer 9 · 2013-06-27

0

Entering edit mode

12.0 years ago

kukumayas • 0

go ftp://ftp.ncbi.nlm.nih.gov/refseq/release for the whole version. Good luck!

ADD COMMENT • link 12.0 years ago by kukumayas • 0

score 0 · Answer 10 · 2013-09-18

I tried the above link and downloaded the entire vertebrate-mammalian refseq data set from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian/, which represents data from 500+ species. However, it appears that the same human genes are missing as from the three refseqgene.genomic.gbff files. Has anyone found out where exactly the more complete dataset is?

score 0 · Answer 11 · 2018-12-19

0

Entering edit mode

6.5 years ago

hsiaoyi0504 ▴ 70

I would like to provide another solution. You can process from raw data. Take a look of my repo: https://github.com/hsiaoyi0504/gene_dictionary.

ADD COMMENT • link 6.5 years ago by hsiaoyi0504 ▴ 70