Looking for clean list of gene names for UCSC's HG19
5.9 years ago
kynnjo ▴ 40

I'm trying to make sense of some data and analysis I've inherited.  It's a big mess.  The results include some gene names that are clearly corrupted ("Excel genes", etc.).

Therefore, for starters, I want to determine a "clean" list of the "gene universe" that was used for the analysis.  All I have been able to ascertain is that the reference gene data used came from UCSC's HG19 build.

By "clean" I mean that the list I'm looking for should

• be complete
• contain no synonyms
• consist of identifiers belonging to one and only one system of nomenclature

Of course, the ideal would be a simple file, published by UCSC, consisting of one gene name per line, no duplicates, but after much searching I have not been able to find such a file.

Does such a file exist?  If not what would be the closest I could find?

5.9 years ago
Ram 34k

You can use the method suggested by Gian, or use UCSC's mysql access.

The query:

SELECT DISTINCT geneSymbol from kgXref kx;


or, directly from the command line:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e 'SELECT DISTINCT geneSymbol from kgXref'

5.9 years ago
Lemire ▴ 900

This

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz


then

zcat refFlat.txt.gz | cut -f 1 | sort -u