Question: Pubmed gene database
gravatar for eulianova89
7 months ago by
eulianova890 wrote:

Hi all,

I needed a list of all human genes (coding and non-coding), so i went here and downloaded gene2pubmed.gz
The list provides the following:

tax_id: the unique identifier provided by NCBI Taxonomy for the species or strain/isolate

GeneID: unique identifier for a gene

PubMed ID (PMID): unique identifier in PubMed for a citation

Opening the file in Notepad++ i see over 12mil lines, 1 for each unique GeneID. Now, as far as i've learned there are just under 40,000 genes in humans (coding + non-coding); says ".....21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes." (

So why are there so many records in the NCBI's file? Are regulatory elements (promoters, enhancers, etc) considered "non-coding genes"? I'm so confused...

Thank you, Jane

gene genome • 865 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 7 months ago by eulianova890

probably they include also alternative spliced genes

ADD REPLYlink written 7 months ago by Morris_Chair220

Thats a great suggestion, thank you!

ADD REPLYlink written 7 months ago by eulianova890
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour