More than 60000 records with unique ENSG tag in ensGene.txt.gz downloaded from UCSC
2
1
Entering edit mode
8.7 years ago
Emma ▴ 10

Hi all,

I downloaded ensGene.txt.gz from UCSC, and found that there were 60234 records with unique ENSG...ID. Does this mean there are more than 60000 genes? However, there were only ~20000 unique gene names in the refGene.txt.gz which also downloaded from UCSC. How such discrepancy came along?

Thank you!

Emma

Ensembl gene genome • 3.0k views
ADD COMMENT
0
Entering edit mode

in UCSC I found 54,210 refseq genes for hg19. Where do you download your refGene.txt.gz file ?

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
8.7 years ago

Those are Ensembl Genes that are not par of RefGene. Let's looks at some records (ensembl with no overlap to refGene)

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu hg19  -A
mysql> select distinct G.name2,R.name,R.name2 from ensGene as G left join refGene as R on R.chrom=G.chrom and NOT(R.txStart>=G.txEnd OR R.txEnd<G.txStart) where R.name is NULL limit 10;
+-----------------+------+-------+
| name2           | name | name2 |
+-----------------+------+-------+
| ENSG00000268020 | NULL | NULL  |
| ENSG00000240361 | NULL | NULL  |
| ENSG00000238009 | NULL | NULL  |
| ENSG00000239945 | NULL | NULL  |
| ENSG00000241860 | NULL | NULL  |
| ENSG00000222623 | NULL | NULL  |
| ENSG00000241599 | NULL | NULL  |
| ENSG00000228463 | NULL | NULL  |
| ENSG00000241670 | NULL | NULL  |
| ENSG00000237094 | NULL | NULL  |
+-----------------+------+-------+
10 rows in set (0.80 sec)
ENSG00000268020 : Havana Gene
ENSG00000240361 : Havana Gene
ENSG00000238009 : Havana Gene
....
ADD COMMENT
1
Entering edit mode
8.7 years ago
Emily 23k

There's a breakdown of gene types here. >20,000 coding genes, >25,000 non-coding, >14,000 pseudogenes. That comes out at <60,000 but that's the current release - not sure which release you downloaded.

ADD COMMENT
0
Entering edit mode

I am sorry but I do not know where to find out the release version. But here is the link:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ensGene.txt.gz

There is a date "06-Apr-2014" with it.

I grep the unique ENSG ID with the following commend line:

less ensGene.txt.gz | awk '{print $13}' | sort | uniq |wc -l

Hope this information helps.

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1360 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6