Get rs number based on position
6
17
Entering edit mode
10.0 years ago
Zach Stednick ▴ 660

Similar to this question, I have a list of SNPs by in chr:pos format that I want to check and see if they have an rs number or not.

On the UCSC Table Browser I have been pasting my data in as chr:pos-chr:pos + 1 to query for SNPs. I am assuming that I can make a similar query with mysql but I am having a hard time figuring out how to load in that data into mysql.

Am I approaching this properly or am I totally off the mark?

ucsc mysql snp • 27k views
ADD COMMENT
14
Entering edit mode
10.0 years ago
Neilfws 49k

If you want to loop through a list of SNPs, some scripting will be necessary as outlined by Chris.

Here's a partial, quick-and-dirty solution using your data. Assuming that by "chr:pos format", you have a file called snps.txt which looks like this:

1:10327
1:10434
1:10440
1:10469
...

You could use this awk one-liner to rewrite the file in the form of SQL queries:

awk 'BEGIN {FS=":"; OFS=""} {print "select chrom,chromStart,name from snp132
where chrom = \"chr",$1,"\" and chromStart + 1 = ",$2,";"}' snps.txt > snps.sql

Which will generate a file snps.sql looking like this:

select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10327;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10434;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10440;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10469;
...

Then query the MySQL server using:

mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 --skip-column-names < snps.sql

Result:

chr1    10326    rs112750067
chr1    10433    rs56289060
chr1    10439    rs112155239
chr1    10439    rs112766696
chr1    10468    rs117577454

Note: This assumes that the SNP positions in your file are 1-based and you want to match them only with chromStart in the UCSC MySQL table, which is 0-based. You may want to match with chromEnd, or use both chromStart and chromEnd (they are not always equal), in which case the query will be more complex.

Firing off individual queries to the MySQL server is rather slow and inefficient. Try not to hammer their server, or they will probably block you. However, it's not difficult to set up a local MySQL server with only the tables that you need.

ADD COMMENT
3
Entering edit mode

I don't really see how that invalidates my solution; however, feel free to provide an answer using a bin query :-)

ADD REPLY
1
Entering edit mode

+1: the performance is not an issue here, and Neil's solution is elegant.

ADD REPLY
0
Entering edit mode

Sorry for keeping downvoting for the same problem: chromStart in the snp132 table is NOT indexed. Understand and use the "bin" field please (see also http://bit.ly/ucscsql).

ADD REPLY
0
Entering edit mode

Your answer gives the right results, but we should discourage inefficient SQL, for you and for others connecting to UCSC MySQL.

ADD REPLY
0
Entering edit mode

@Neil you're not Australian, you're using the UCSC and not biomart: it's the end of my world for good :-)

ADD REPLY
0
Entering edit mode

For a couple SNPs, not using bin is fine. But Zach has 700 SNPs. Querying 700 SNPs this way is going to be very inefficient. There are simply much better solutions.

ADD REPLY
9
Entering edit mode
10.0 years ago
lh3 32k

Another post advertising my batchUCSC.pl script (see also http://bit.ly/ucscsql). With this script you can:

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -p snp132:::

or

cat regions.bed | ./batchUCSC.pl -d hg19 -p snp132:::

Note that for most UCSC tables, chromStart and chromEnd are not indexed. For efficient queries, you have to use the mystic "bin" field. You need to write a script to do that. There are a few questions about this.

EDIT: You can print the SQL with "-e". For example:

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -ep snp132:::

gives

SELECT * FROM snp132 WHERE chrom="chr1" AND chromEnd>=10325 AND chromStart<=10326 AND (bin=1 OR bin=1 OR bin=9 OR bin=73 OR bin=585)
ADD COMMENT
0
Entering edit mode

Hi, could you link to the 'few questions' about using the 'mystic "bin" field' or show us what the script using the bin would look like? Cheers

ADD REPLY
6
Entering edit mode
10.0 years ago

To complete Neil's answer. The following C code will add the bin column:

compilation:

gcc bin.c

execution:

cat input.txt
1:10327
1:10434
1:10440
1:10469

awk -F '\:' '{printf("chr%s\t%d\t%d\n",$1,int($2)-1,int($2));}' < input.txt |\
./a.out |\
awk '{printf("select name,chromStart+1 from snp132 where chrom=\"%s\" and bin in(%s) and NOT(chromEnd<=%s or chromStart>%s);\n",$1,$4,$2,$2);}' |\
mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 -N 
rs112750067    10327
rs56289060    10434
rs112155239    10440
rs112766696    10440
rs117577454    10469
ADD COMMENT
0
Entering edit mode

note that there will be a problem for the the few snps overlapping two bins. They might not be found, that is why lh3's script would run better as it prints all the parental bins.

ADD REPLY
0
Entering edit mode

Hey,

Is there a good way to exclude rsIDs that have been merged?

Thanks!

ADD REPLY
6
Entering edit mode
10.0 years ago

The Ensembl Variant Effect Predictor reveals rsIDs for input genomic positions. It takes standard or VCF formats. You can either go through the Online tool, or use the API script with the Variation API:

http://www.ensembl.org/tools.html

More about the input format is here:

http://www.ensembl.org/info/website/upload/var.html

Installation of the Perl API is here:

http://www.ensembl.org/info/docs/api/variation/index.html

HTH.

ADD COMMENT
5
Entering edit mode
10.0 years ago

Do you have a lot of SNPs to query? If so, the easiest way is probably to download the whole snp track from UCSC (under Table Browser > Variation and Repeats), then write a little script to compare your snps to the table.

I'd probably hash my snps by chr:pos, then step through the snp file line by line, outputting only those rs ids that matched my snps.

ADD COMMENT
0
Entering edit mode

Thats a good idea, I only have about 700 and this seems pretty tractable

ADD REPLY
4
Entering edit mode
10.0 years ago

If you have Ruby-1.9 environment, try BioRuby-UCSC-API. The library uses the 'bin' index system internally. Please see https://github.com/misshie/bioruby-ucsc-api for details.

$ sudo gem install bio-ucsc-api --no-ri --no-rdoc

$ cat input.txt
chr1:10327-10327
chr1:10434-10434
chr1:10440-10440
chr1:10469-10469

$ cat query.rb
#!/usr/local/bin/ruby-1.9
require 'bio-ucsc'
include Bio::Ucsc::Hg19

DBConnection.connect
ARGF.each_line do |row|
  row.chomp!
  iv = Bio::GenomicInterval.parse(row)
  results = Snp132.find_all_by_interval(iv)
  print row
  unless results.empty?
    puts " => #{results.map{|e|e.name}.join(", ")}"
  else
    puts " => none"
  end
end

$ ruby-1.9 query.rb input.txt
chr1:10327-10327 => rs112750067
chr1:10434-10434 => rs56289060
chr1:10440-10440 => rs112155239, rs112766696
chr1:10469-10469 => rs117577454
ADD COMMENT

Login before adding your answer.

Traffic: 1142 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6