Pubmed gene database
0
0
Entering edit mode
3.8 years ago
eva_u ▴ 10

Hi all,

I needed a list of all human genes (coding and non-coding), so i went here https://ftp.ncbi.nih.gov/gene/DATA/ and downloaded gene2pubmed.gz
The list provides the following:

tax_id: the unique identifier provided by NCBI Taxonomy for the species or strain/isolate

GeneID: unique identifier for a gene

PubMed ID (PMID): unique identifier in PubMed for a citation

Opening the file in Notepad++ i see over 12mil lines, 1 for each unique GeneID. Now, as far as i've learned there are just under 40,000 genes in humans (coding + non-coding); Nature.com says ".....21,306 protein-coding genes and 21,856 non-coding genes — many more than are included in the two most widely used human-gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes." (https://www.nature.com/articles/d41586-018-05462-w)

So why are there so many records in the NCBI's file? Are regulatory elements (promoters, enhancers, etc) considered "non-coding genes"? I'm so confused...

Thank you, Jane

genome gene • 5.8k views
ADD COMMENT
1
Entering edit mode

probably they include also alternative spliced genes

ADD REPLY
0
Entering edit mode

Thats a great suggestion, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2650 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6