Get rs number based on position
6
17
Entering edit mode
10.0 years ago
Zach Stednick ▴ 660

Similar to this question, I have a list of SNPs by in chr:pos format that I want to check and see if they have an rs number or not.

On the UCSC Table Browser I have been pasting my data in as chr:pos-chr:pos + 1 to query for SNPs. I am assuming that I can make a similar query with mysql but I am having a hard time figuring out how to load in that data into mysql.

Am I approaching this properly or am I totally off the mark?

ucsc mysql snp • 27k views
14
Entering edit mode
10.0 years ago
Neilfws 49k

If you want to loop through a list of SNPs, some scripting will be necessary as outlined by Chris.

Here's a partial, quick-and-dirty solution using your data. Assuming that by "chr:pos format", you have a file called snps.txt which looks like this:

1:10327
1:10434
1:10440
1:10469
...


You could use this awk one-liner to rewrite the file in the form of SQL queries:

awk 'BEGIN {FS=":"; OFS=""} {print "select chrom,chromStart,name from snp132
where chrom = \"chr",$1,"\" and chromStart + 1 = ",$2,";"}' snps.txt > snps.sql


Which will generate a file snps.sql looking like this:

select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10327;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10434;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10440;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10469;
...


Then query the MySQL server using:

mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 --skip-column-names < snps.sql


Result:

chr1    10326    rs112750067
chr1    10433    rs56289060
chr1    10439    rs112155239
chr1    10439    rs112766696
chr1    10468    rs117577454


Note: This assumes that the SNP positions in your file are 1-based and you want to match them only with chromStart in the UCSC MySQL table, which is 0-based. You may want to match with chromEnd, or use both chromStart and chromEnd (they are not always equal), in which case the query will be more complex.

Firing off individual queries to the MySQL server is rather slow and inefficient. Try not to hammer their server, or they will probably block you. However, it's not difficult to set up a local MySQL server with only the tables that you need.

3
Entering edit mode

I don't really see how that invalidates my solution; however, feel free to provide an answer using a bin query :-)

1
Entering edit mode

+1: the performance is not an issue here, and Neil's solution is elegant.

0
Entering edit mode

Sorry for keeping downvoting for the same problem: chromStart in the snp132 table is NOT indexed. Understand and use the "bin" field please (see also http://bit.ly/ucscsql).

0
Entering edit mode

Your answer gives the right results, but we should discourage inefficient SQL, for you and for others connecting to UCSC MySQL.

0
Entering edit mode

@Neil you're not Australian, you're using the UCSC and not biomart: it's the end of my world for good :-)

0
Entering edit mode

For a couple SNPs, not using bin is fine. But Zach has 700 SNPs. Querying 700 SNPs this way is going to be very inefficient. There are simply much better solutions.

9
Entering edit mode
10.0 years ago
lh3 32k

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -p snp132:::


or

cat regions.bed | ./batchUCSC.pl -d hg19 -p snp132:::


Note that for most UCSC tables, chromStart and chromEnd are not indexed. For efficient queries, you have to use the mystic "bin" field. You need to write a script to do that. There are a few questions about this.

EDIT: You can print the SQL with "-e". For example:

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -ep snp132:::


gives

SELECT * FROM snp132 WHERE chrom="chr1" AND chromEnd>=10325 AND chromStart<=10326 AND (bin=1 OR bin=1 OR bin=9 OR bin=73 OR bin=585)

0
Entering edit mode

Hi, could you link to the 'few questions' about using the 'mystic "bin" field' or show us what the script using the bin would look like? Cheers

6
Entering edit mode
10.0 years ago

To complete Neil's answer. The following C code will add the bin column:

compilation:

gcc bin.c


execution:

cat input.txt
1:10327
1:10434
1:10440
1:10469

awk -F '\:' '{printf("chr%s\t%d\t%d\n",$1,int($2)-1,int($2));}' < input.txt |\ ./a.out |\ awk '{printf("select name,chromStart+1 from snp132 where chrom=\"%s\" and bin in(%s) and NOT(chromEnd<=%s or chromStart>%s);\n",$1,$4,$2,$2);}' |\ mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 -N rs112750067 10327 rs56289060 10434 rs112155239 10440 rs112766696 10440 rs117577454 10469  ADD COMMENT 0 Entering edit mode note that there will be a problem for the the few snps overlapping two bins. They might not be found, that is why lh3's script would run better as it prints all the parental bins. ADD REPLY 0 Entering edit mode Hey, Is there a good way to exclude rsIDs that have been merged? Thanks! ADD REPLY 6 Entering edit mode 10.0 years ago The Ensembl Variant Effect Predictor reveals rsIDs for input genomic positions. It takes standard or VCF formats. You can either go through the Online tool, or use the API script with the Variation API: http://www.ensembl.org/tools.html More about the input format is here: http://www.ensembl.org/info/website/upload/var.html Installation of the Perl API is here: http://www.ensembl.org/info/docs/api/variation/index.html HTH. ADD COMMENT 5 Entering edit mode 10.0 years ago Do you have a lot of SNPs to query? If so, the easiest way is probably to download the whole snp track from UCSC (under Table Browser > Variation and Repeats), then write a little script to compare your snps to the table. I'd probably hash my snps by chr:pos, then step through the snp file line by line, outputting only those rs ids that matched my snps. ADD COMMENT 0 Entering edit mode Thats a good idea, I only have about 700 and this seems pretty tractable ADD REPLY 4 Entering edit mode 10.0 years ago If you have Ruby-1.9 environment, try BioRuby-UCSC-API. The library uses the 'bin' index system internally. Please see https://github.com/misshie/bioruby-ucsc-api for details. $ sudo gem install bio-ucsc-api --no-ri --no-rdoc

$cat input.txt chr1:10327-10327 chr1:10434-10434 chr1:10440-10440 chr1:10469-10469$ cat query.rb
#!/usr/local/bin/ruby-1.9
require 'bio-ucsc'
include Bio::Ucsc::Hg19

DBConnection.connect
ARGF.each_line do |row|
row.chomp!
iv = Bio::GenomicInterval.parse(row)
results = Snp132.find_all_by_interval(iv)
print row
unless results.empty?
puts " => #{results.map{|e|e.name}.join(", ")}"
else
puts " => none"
end
end

\$ ruby-1.9 query.rb input.txt
chr1:10327-10327 => rs112750067
chr1:10434-10434 => rs56289060
chr1:10440-10440 => rs112155239, rs112766696
chr1:10469-10469 => rs117577454