Question: Pubmed gene database
0
gravatar for eulianova89
7 months ago by
eulianova890 wrote:

Hi all,

I needed a list of all human genes (coding and non-coding), so i went here https://ftp.ncbi.nih.gov/gene/DATA/ and downloaded gene2pubmed.gz
The list provides the following:

tax_id: the unique identifier provided by NCBI Taxonomy for the species or strain/isolate

GeneID: unique identifier for a gene

PubMed ID (PMID): unique identifier in PubMed for a citation

Opening the file in Notepad++ i see over 12mil lines, 1 for each unique GeneID. Now, as far as i've learned there are just under 40,000 genes in humans (coding + non-coding); Nature.com says ".....21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes." (https://www.nature.com/articles/d41586-018-05462-w)

So why are there so many records in the NCBI's file? Are regulatory elements (promoters, enhancers, etc) considered "non-coding genes"? I'm so confused...

Thank you, Jane

gene genome • 865 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 7 months ago by eulianova890
1

probably they include also alternative spliced genes

ADD REPLYlink written 7 months ago by Morris_Chair220

Thats a great suggestion, thank you!

ADD REPLYlink written 7 months ago by eulianova890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour
_