I have a pipeline in which one of the BED files contains a bunch of RefSeq ID's. I obtained the original BED file from the UCSC Table Browser under the RefSeq table, so I assume that all of the RefSeq ID's will be located somewhere in UCSC's mysql database.
I ran this command:
mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select name,name2 from refGene"
However, when I compare the results of this output against what I have in my ID list, it appears that only ID's in the form of 'NR_' or 'NM_' have matches, whereas the ID's with 'XR_' or 'XM_' don't.
I know that there must be some relation between all of the RefSeq ID's and their gene symbols. For instance, XR_001755761 takes me to this NCBI page, which shows me that the corresponding gene symbol is 'LOC101928055'. I've been able to obtain gene symbols for all of the NR_ or NM_ identifiers using the mysql database, but don't know how to convert all the others.
Is there an easy way that I can programatically get all of these? I don't want to use a manual copy-paste service because this needs to be run in a pipeline. Currently, I run the SQL command and save it as a TSV, which I then serialize as a hashmap to let me quickly convert all the different RefSeq ID's to their respective gene symbol. The end goal is to simply count the number of unique genes in the file. What is the easiest way to get this done? I'm open to using R or some other script, as long as I only have to run it once so I can generate a python pickled dict for quick use.