Question: Bulk Download Of Ncbi Gene "Summary" Field
12
gravatar for David Quigley
8.5 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

I would like to download or manufacture a mapping of entrez gene IDs to the text that appears in the "Summary" field on an Entrez Gene query for the H. sapiens record for that gene. These short paragraphs are often useful for getting a first idea about what an unfamiliar gene does. The obvious approach of scouring ftp.ncbi.nih.gov/gene/ and ftp://ftp.ncbi.nih.gov/refseq/ for the appropriate record (e.g. gene_info.gz) didn't turn anything up. Thanks for any suggestions.

gene ncbi • 9.2k views
ADD COMMENTlink modified 9 weeks ago by hsiaoyi050440 • written 8.5 years ago by David Quigley11k
8
gravatar for David Quigley
8.5 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

Thanks for the link, Pierre. This 2.6 Gb. file is very verbose and structured for human consumption rather than easy of retrieval. I wrote a quick and dirty Python parser to pull out the summaries and am posting this so someone else doesn't have to do it too. Note that the accessions are not entrez gene ids; you have to map those separately.

f = open('refseqgene.genomic.gbff')
locus2comment = {}
in_comment=False
for line in f:
    if line[0:5] == "LOCUS":
        locus = line.split()[1]
        comment = ""
    elif line[0:7] == "COMMENT":
        in_comment=True
        comment += line.split("    ")[1].replace("\n", " ")
    elif line[0:7] == "PRIMARY":
        in_comment = False
        try:
            locus2comment[locus] = comment.split("Summary:")[1]
        except:
            locus2comment[locus] = comment
    elif in_comment:
        comment += line.split("            ")[1].replace("\n", " ")
for locus in sorted(locus2comment):
    print locus + '\t' + locus2comment[locus]
ADD COMMENTlink written 8.5 years ago by David Quigley11k

Thanks David. As a future reference , mapping NG_ ids can be done using this file ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/gene_RefSeqGene

ADD REPLYlink written 7.8 years ago by sa9800

Thanks David. In case any one needs it , mapping NG_ ids can be done using this file ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/LRG_RefSeqGene

ADD REPLYlink written 7.8 years ago by sa9800
6
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

This info is in ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene.genomic.gbff.gz

ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum116k
1

It looks like this file has been split into two now, so the URLS are:

ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene1.genomic.gbff.gz ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene2.genomic.gbff.gz

ADD REPLYlink written 8.0 years ago by Richard Smith400

UPDATE: as of 30th Apr 2011, there are three files ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/

ADD REPLYlink written 7.8 years ago by sa9800

As of 2017 you would be better of using: wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/refseqgene.*.genomic.gbff.gz ;)

ADD REPLYlink written 19 months ago by krassowski.michal40
3
gravatar for alirezahkb
4.0 years ago by
alirezahkb30
United States
alirezahkb30 wrote:

The best way is to use ncbi's eutils. There is an API that returns summary by gene_id. 

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=604,1&retmode=json;

Here is a python script that I wrote to get summaries for gene ids specified in an input file.

import numpy as np
from os import sys, path
import pandas as pd
import urllib2
import json
import sys

if __name__=='__main__':
    gene_info_file = sys.argv[1];
    output_file = sys.argv[2];
    open(output_file, 'w').close()
    gene_ids = pd.unique(pd.read_csv(gene_info_file)['1']);
    chunk_size = 100;
    cn = len(gene_ids)/chunk_size+1
    for i in range(cn):
        chunk_genes = gene_ids[chunk_size*i:np.min([chunk_size*(i+1), len(gene_ids)])];
        gids = ','.join([str(s) for s in chunk_genes])
        print (i+1),'/',cn
        url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=' + gids + '&retmode=json';
        print url
        data = json.load(urllib2.urlopen(url))
        result = [];
        for g in chunk_genes:
            result.append([g, data['result'][str(g)]['summary'] if str(g) in data['result'] else '']);
        pd.DataFrame(result, columns=['gene_id', 'summary']).to_csv(output_file, index=False, mode='a', header= (i==0))               

Usage:

python eutilsGetSummary.py gene_ids.csv gene_summary.csv

gene_ids.csv is a csv file with the first column holding the gene_ids you want to get the summaries for. 

gene_summary.csv is the output. 

 

 

 

 

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by alirezahkb30
1
gravatar for Khader Shameer
8.5 years ago by
Manhattan, NY
Khader Shameer17k wrote:

You may also take a look at GeneRIF from NCBI. You can download GeneRIFs from NCBI FTP

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Khader Shameer17k
1
gravatar for govardhanks
2.8 years ago by
govardhanks20
govardhanks20 wrote:

(not a option for bulk downloads though) Would like to suggest UCSC Table browser (https://genome.ucsc.edu/cgi-bin/hgTables) --you might need to enter refseqgene under groups and input your gene's NM_IDS or standard HGNC symbol,output format-selected fields-- try with given examples you can figure out how to get desired output. hope this helps

ADD COMMENTlink written 2.8 years ago by govardhanks20
1
gravatar for krassowski.michal
19 months ago by
krassowski.michal40 wrote:

I was pointed to this answer some time ago, so inspired by govardhank's answer, I used UCSC hgTables to download gene summaries, indexed by RefSeq mRNA. There is a special table with summaries: hgFixed.refSeqSummary.

The gbff files (for Homo Sapiens) parsed with script from weslfield's comment gave 6 661 unique summaries. The UCSC table returned 26 140 unique summaries although these included mouse genes too (and possibly others)*; After mapping the summaries to subset of human mRNAs which I am currently working with I got 12 574 unique summaries, which doubles the gbff parsing coverage.

Also UCSC returns data for QPCT, REV1, NEB (not sure about NICK10, I couldn't find such gene) mentioned by Dave Curtis as missing in gbff files.

Feel free to use my gist for UCSC table retrieval: ucsc_download.sh. Here is the how to use it:

source ucsc_download.sh
get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip

Summary: As for 2017 please use UCSC tables, those are more complete, easier to fetch and parse. They did a really good job at making those tables.

(*) I am not sure why, but I know that it is not script-specific - I got the same when using the web interface; any advice how to avoid this would be appreciated.

ADD COMMENTlink written 19 months ago by krassowski.michal40

Hi, Michal, thank you for your awesome work. I've tried your script and run it as you put, however something just went run. Can you do me a favor?

below is how I run it and the output

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll

total 4.0K -rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % source ucsc_download.sh

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip

--2017-08-21 20:54:51-- http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.134, 128.114.119.135, 128.114.119.133, ... Connecting to genome.ucsc.edu|128.114.119.134|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'STDOUT'

  • [ <=> ] 40.56K 109KB/s in 0.4s

2017-08-21 20:54:52 (109 KB/s) - written to stdout [41537]

hgsid=604153833_Jsj2HlaA2zgpZxEPKxAxKQ0XgqWL&jsh_pageVerPos=0&posiion:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hga_group=genes&hga_rack=refGene&hga_able=hgFixed.refSeqSummary&hga_regionType=genome&hga_oupuType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGrea=0&boolshad.sendToGenomeSpace=0&hga_ouFileName=oupu&hga_compressType=gzip&hga_doTopSubmi=ge+oupu --2017-08-21 20:54:52-- http://genome.ucsc.edu/cgi-bin/hgTables Resolving genome.ucsc.edu... 128.114.119.135, 128.114.119.133, 128.114.119.136, ... Connecting to genome.ucsc.edu|128.114.119.135|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: 'summary.tsv.gz'

summary.tsv.gz [ <=> ] 41.50K 85.1KB/s in 0.5s

2017-08-21 20:54:54 (85.1 KB/s) - 'summary.tsv.gz' saved [42497]

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll

total 48K

-rw-r--r-- 1 gongjing staff 42K Aug 21 20:54 summary.tsv.gz

-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

ADD REPLYlink modified 18 months ago • written 18 months ago by gongjing.rss0

The output looks good. Where is the problem? Are you able to open summary.tsv.gz?

ADD REPLYlink written 18 months ago by krassowski.michal40

Hi, Michal,

I cannot unzip the file normally, and the content seems to be in HTML format. Besides, the file size is small? So I am not sure if I get the result correctly.

Here is the file information:

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % ll

total 48K -rw-r--r-- 1 gongjing staff 42K Aug 21 20:54 summary.tsv.gz

-rwxr--r-- 1 gongjing staff 2.0K Aug 21 16:02 ucsc_download.sh

gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % gunzip summary.tsv.gz

gunzip: summary.tsv.gz: not in gzip format

1 gongjing@hekekedeiMac ..tion-corrrelation/results/biomart/ucsc % head summary.tsv.gz
<html> <head>

<meta http-equiv="Content-Security-Policy" content="default-src *; script-src 'self' 'unsafe-inline' 'nonce-BccRJqnamHSWfPUAF1g3aUeqFp0u' code.jquery.com www.google-analytics.com www.samsarin.com/project/dagre-d3/latest/dagre-d3.js cdnjs.cloudflare.com/ajax/libs/d3/3.4.4/d3.min.js cdnjs.cloudflare.com/ajax/libs/jquery/1.12.1/jquery.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.2.1/jstree.min.js cdnjs.cloudflare.com/ajax/libs/bowser/1.6.1/bowser.min.js cdnjs.cloudflare.com/ajax/libs/jstree/3.3.4/jstree.min.js login.persona.org/include.js ajax.googleapis.com/ajax maxcdn.bootstrapcdn.com/bootstrap d3js.org/d3.v3.min.js cdn.datatables.net; style-src * 'unsafe-inline'; font-src * data:; img-src * data:;">

<meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1"> <meta http-equiv="Content-Script-Type" content="text/javascript"> <meta http-equiv="Pragma" content="no-cache"> <meta http-equiv="Expires" content="-1">

ADD REPLYlink modified 18 months ago • written 18 months ago by gongjing.rss0

That is strange, indeed. For me it's working and retrieves gzipped summary with weight about 5.4MB. I tested it right now on Ubuntu 17.04 with GNU Wget 1.18 and GNU sed 4.4.

Looking at your logs I found oupu instead of output, posiion instead of position. I guess either your bash or your version of sed ignores '\t' tabulation substitution, and instead cuts out some t letters.

Wild guess: https://stackoverflow.com/questions/2610115/sed-not-recognizing-t-instead-it-is-treating-it-as-t-why There are many solutions, depending on your environment (OS, sed version). Please try some, and let me know if it helped. I would start with literal tab or double escaping '\\t'. Remember to source ucsc_download.sh afterwards again!

Here are my logs:

$ get_whole_genome_table summary.tsv.gz genes refGene hgFixed.refSeqSummary gzip
--2017-08-24 20:00:10--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘STDOUT’

-                                                        [  <=>                                                                                                                 ]  40,56K   110KB/s    in 0,4s    

2017-08-24 20:00:11 (110 KB/s) - written to stdout [41537]

hgsid=604746209_UcaSA7yJymRhoxJhcNr5WhGvCavS&jsh_pageVertPos=0&position:chr21=33031597-33041570&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=refGene&hgta_table=hgFixed.refSeqSummary&hgta_regionType=genome&hgta_outputType=primaryTable&boolshad.sendToGalaxy=0&boolshad.sendToGreat=0&boolshad.sendToGenomeSpace=0&hgta_outFileName=output&hgta_compressType=gzip&hgta_doTopSubmit=get+output
--2017-08-24 20:00:11--  http://genome.ucsc.edu/cgi-bin/hgTables
Resolving genome.ucsc.edu genome.ucsc.edu)... 128.114.119.136, 128.114.119.135, 128.114.119.133, ...
Connecting to genome.ucsc.edu genome.ucsc.edu)|128.114.119.136|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘summary.tsv.gz’

summary.tsv.gz                                           [                              <=>                                                                                     ]   5,37M  1,14MB/s    in 6,2s    

2017-08-24 20:00:18 (882 KB/s) - ‘summary.tsv.gz’ saved [5635263]
ADD REPLYlink modified 18 months ago • written 18 months ago by krassowski.michal40
0
gravatar for Dave Curtis
6.9 years ago by
Dave Curtis0 wrote:

This seems to work fine but many genes seem to be missing. This applies to many of the list which can be downloaded from the Human Genome Browser at http://genome.ucsc.edu/cgi-bin/hgGateway. Off the top of my head, examples are QPCT, REV1, NICK1, NEB. I don't see anything particularly wrong with these genes but they seem to be absent from the refSeqGene files. Can anybody explain why this is or better still point to a more comprehensive source for this information? Thanks.

ADD COMMENTlink written 6.9 years ago by Dave Curtis0

We at SolveBio have also run into this problem while parsing these flat files. There are some odd exceptions and omissions in these flat files. For instance, sometimes records appear with no summary or there are duplicate records. There are even instances where the gene is just absent. Here is an expansion on the code written by David Quigley that tries to account for some of this. 

import sys
import gzip

def run(filepath):
    with gzip.open(filepath, 'rb') as f:
        locus2comment = {}
        in_comment = False
        first_time_symbol = False
        first_time_entrez = False
        real_gene = False
        for line in f:
            if line[0:5] == "LOCUS":
                real_gene = False
                first_time_symbol = True
                first_time_entrez = True
                locus = line.split()[1]
                comment = ""
            # elif line[0:7] == 'VERSION':
            #     print line.split('\n')
                # locus2comment[locus] = line.strip().split()[0]
            elif line[0:7] == "COMMENT":
                in_comment = True
                comment += line.split("    ")[1].replace("\n", " ")
            elif line[0:7] == "PRIMARY":
                in_comment = False
                try:
                    locus2comment[locus] = comment.split("Summary:")[1]
                except:
                    locus2comment[locus] = "Remove Me"
            elif line[0:9] == '     gene' and 'complement' not in line:
                real_gene = True
            elif line[0:27] == '                     /gene=':
                if first_time_symbol and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('/gene=\"')[1][:-1]
                    first_time_symbol = False
            elif '/db_xref="GeneID:' in line:
                if first_time_entrez and real_gene:
                    locus2comment[locus] += \
                        '\t' + line.strip().split('GeneID:')[1][:-1]
                    first_time_entrez = False
            elif in_comment:
                comment += line.split("            ")[1].replace("\n", " ")

    for locus in sorted(locus2comment):
            if "Remove Me" in locus2comment[locus]:
                del locus2comment[locus]

    with gzip.open(filepath[:-8] + '.tsv.gz', 'wb') as outfile:
        for locus in sorted(locus2comment):
            if locus2comment[locus] == "Remove Me":
                continue
            outfile.write(locus + '\t' + locus2comment[locus] + '\n')

if __name__ == '__main__':
    run(sys.argv[1])

SolveBio parses and versions this dataset along with many others popular in bioinformatics with a full API for easy access. Check it out, you may find it saves you a lot of time. https://www.solvebio.com/library/RefSeqGene

ADD REPLYlink written 3.1 years ago by weslfield90
0
gravatar for J_G
6.9 years ago by
J_G0
J_G0 wrote:

I agree with Dave, in the three refseqgene.genomic.gbff files there are only about 4700 genes, a handful without a entrez ID. Entrez iself and Genecards show refseq summary for all the missing genes I tried. So something is wrong.

ADD COMMENTlink written 6.9 years ago by J_G0
0
gravatar for kukumayas
5.7 years ago by
kukumayas0
kukumayas0 wrote:

go ftp://ftp.ncbi.nlm.nih.gov/refseq/release for the whole version. Good luck!

ADD COMMENTlink written 5.7 years ago by kukumayas0
0
gravatar for aliz0611
5.4 years ago by
aliz06110
aliz06110 wrote:

I tried the above link and downloaded the entire vertebrate-mammalian refseq data set from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/vertebrate_mammalian/, which represents data from 500+ species. However, it appears that the same human genes are missing as from the three refseqgene.genomic.gbff files. Has anyone found out where exactly the more complete dataset is?

ADD COMMENTlink written 5.4 years ago by aliz06110
0
gravatar for hsiaoyi0504
9 weeks ago by
hsiaoyi050440
Taiwan
hsiaoyi050440 wrote:

I would like to provide another solution. You can process from raw data. Take a look of my repo: https://github.com/hsiaoyi0504/gene_dictionary.

ADD COMMENTlink written 9 weeks ago by hsiaoyi050440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour