Question: UCSC MySQL query to find gene symbols for all RefSeq IDs?
0
gravatar for edsouza
4 months ago by
edsouza30
edsouza30 wrote:

I have a pipeline in which one of the BED files contains a bunch of RefSeq ID's. I obtained the original BED file from the UCSC Table Browser under the RefSeq table, so I assume that all of the RefSeq ID's will be located somewhere in UCSC's mysql database.

I ran this command:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select name,name2 from refGene"

However, when I compare the results of this output against what I have in my ID list, it appears that only ID's in the form of 'NR_' or 'NM_' have matches, whereas the ID's with 'XR_' or 'XM_' don't.

I know that there must be some relation between all of the RefSeq ID's and their gene symbols. For instance, XR_001755761 takes me to this NCBI page, which shows me that the corresponding gene symbol is 'LOC101928055'. I've been able to obtain gene symbols for all of the NR_ or NM_ identifiers using the mysql database, but don't know how to convert all the others.

Is there an easy way that I can programatically get all of these? I don't want to use a manual copy-paste service because this needs to be run in a pipeline. Currently, I run the SQL command and save it as a TSV, which I then serialize as a hashmap to let me quickly convert all the different RefSeq ID's to their respective gene symbol. The end goal is to simply count the number of unique genes in the file. What is the easiest way to get this done? I'm open to using R or some other script, as long as I only have to run it once so I can generate a python pickled dict for quick use.

assembly ucsc gene genome • 482 views
ADD COMMENTlink modified 4 months ago by genecats.ucsc560 • written 4 months ago by edsouza30
1

. For instance, XR_001755761 take

don't spoil your time, those accession numbers are not present in the UCSC database.

http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=XR_001755761

http://genome.ucsc.edu/cgi-bin/hgTracks?org=Human&db=hg19&position=LOC101928055

you'd better use the resources from NCBI using eUtils.

ADD REPLYlink written 4 months ago by Pierre Lindenbaum115k

As @Pierre said you can try the following using NCBI unix utils. Loop through your ID's.

$ efetch -db nuccore -id "XR_001755761" -format xml | xtract -pattern Gene-ref -element Gene-ref_locus
LOC101928055
ADD REPLYlink modified 4 months ago • written 4 months ago by genomax59k

This looks great, thank you! If I have several thousand ID's, is there an option to supply them at once in a single query so I don't get rate-limited?

ADD REPLYlink modified 4 months ago • written 4 months ago by edsouza30

I think you should follow the answer provided by UCSC support below in that case.

ADD REPLYlink written 4 months ago by genomax59k

The refGene table only contains NM and NR sequences from NCBI, which are then mapped onto the genome with BLAT. If you want all sequences, including the XM, XR, etc, you want to query the "ncbiRefSeq" table:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq" hg19

This blog post contains more information about the new ncbiRefSeq tables, which feature transcripts from RefSeq and use the coordinates provided by RefSeq instead of using coordinates determined by BLAT.

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

General questions: genome@soe.ucsc.edu
Questions involving private data: genome-www@soe.ucsc.edu
Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD REPLYlink written 4 months ago by genecats.ucsc560
1

I'm not completely sure; check these examples:

mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'NR%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'NM%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'XR%' limit 1" hg19
mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ne "select name, name2 from ncbiRefSeq where name like 'XM%' limit 1" hg19

The first 2 will give results but the last 2 won't. If there truly are XR and XM sequences, that would be amazing, but I can't seem to find any.

ADD REPLYlink written 4 months ago by edsouza30
1

NCBI did not include predicted transcripts with their latest hg19 annotation, as stated here:
ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/README_addendum.txt

Which is why they don't show up in our tables for hg19. I should have remembered this and not advised you to search these tables, my apologies.

ADD REPLYlink written 4 months ago by genecats.ucsc560

genecats.ucsc : So your answer is no longer applicable? Consider moving it to a comment in that case.

ADD REPLYlink written 4 months ago by genomax59k
2
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

To do something programmatic, you could use mygene.info with curl to request (query) and jq to parse the JSON result.

Given a text file with each gene name on its own line called my_genes_of_interest.txt, you could pass it to this script, e.g. called map_ids.sh:

#!/bin/bash -e

inFn=$1
apiURL="http://mygene.info/v3/query"

while IFS='' read -r line || [[ -n "$line" ]]; do
    geneId=`echo $line | tr -d $'\n'`
    queryResult=`curl -s "${apiURL}?q=${geneId}" | jq -r '.hits | .[0].symbol' | tr -d $'\n'`
    echo "${geneId}\t${queryResult}"
done < "${inFn}"

Then:

$ ./map_ids.sh my_genes_of_interest.txt > my_goi_mapped_to_symbols.txt

I didn't think to ask how many IDs you are querying, but sending a bunch of single requests may get your IP blocked. As your comment notes, there are Python and R libraries which offer the ability to submit batch queries to the mygene API.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds26k

I'm trying this but I get rate-limited after a few hundred. Is there an easy way to supply a whole list of IDs for conversion at once?

ADD REPLYlink written 4 months ago by edsouza30
2

Replying for posterity: mygene has a python API that lets you batch-submit queries. If your list is over 1000, it'll chunk your list into groups of 1000, perform the searches by group, and then return you a final result. The end-user doesn't have to worry about scaling, as the output format is exactly the same even if you'd only queried a single term. http://docs.mygene.info/projects/mygene-py/en/latest/

Here's the gist of my python code:

#!/usr/bin/env python3
import mygene

idlist = set()
with open('hg38.bed', 'r') as f:
        for line in f:
                id = line.strip().split('\t')[3]
                idlist.add(id)

mg = mygene.MyGeneInfo()
resdf = mg.querymany(list(idlist),
        scopes='refseq',
        fields='symbol',
        species='human',
        as_dataframe=True)

print(resdf) # Do stuff with the dataframe, like serialize it to a file.
ADD REPLYlink modified 4 months ago • written 4 months ago by edsouza30

Nice work — mygene is useful for mapping lots of kinds of IDs, as well, so this is generically useful.

ADD REPLYlink written 4 months ago by Alex Reynolds26k

Did this work for hg19 as well?

ADD REPLYlink written 4 months ago by genecats.ucsc560

Do gene symbol mappings change between assemblies? I would expect that gene locations do, and that would definitely be something to consider when using mygene, if locations are retrieved (cf. http://myvariant.info/blog/mygene-info-moved-to-grch38hg38-for-human-genes-with-hg19-still-supported).

ADD REPLYlink modified 4 months ago • written 4 months ago by Alex Reynolds26k

If you have RefSeq IDs to map to something else, I would use the NCBI efetch tools, as Pierre Lindenbaum suggests above. I would never use a third party webservice for a pipeline you want to run again.

ADD REPLYlink written 4 months ago by Maximilian Haeussler1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1047 users visited in the last hour