Get rs number based on position

Entering edit mode

14.1 years ago

Zach Stednick ▴ 660

Similar to this question, I have a list of SNPs by in chr:pos format that I want to check and see if they have an rs number or not.

On the UCSC Table Browser I have been pasting my data in as chr:pos-chr:pos + 1 to query for SNPs. I am assuming that I can make a similar query with mysql but I am having a hard time figuring out how to load in that data into mysql.

Am I approaching this properly or am I totally off the mark?

ucsc mysql snp • 42k views

ADD COMMENT • link updated 6.6 years ago by zx8754 12k • written 14.1 years ago by Zach Stednick ▴ 660

Entering edit mode

14.1 years ago

Neilfws 49k

If you want to loop through a list of SNPs, some scripting will be necessary as outlined by Chris.

Here's a partial, quick-and-dirty solution using your data. Assuming that by "chr:pos format", you have a file called snps.txt which looks like this:

You could use this awk one-liner to rewrite the file in the form of SQL queries:

awk 'BEGIN {FS=":"; OFS=""} {print "select chrom,chromStart,name from snp132
where chrom = \"chr",$1,"\" and chromStart + 1 = ",$2,";"}' snps.txt > snps.sql

Which will generate a file snps.sql looking like this:

select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10327;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10434;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10440;
select chrom,chromStart,name from snp132 where chrom = "chr1" and chromStart + 1 = 10469;
...

Then query the MySQL server using:

mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 --skip-column-names < snps.sql

Result:

chr1    10326    rs112750067
chr1    10433    rs56289060
chr1    10439    rs112155239
chr1    10439    rs112766696
chr1    10468    rs117577454

Note: This assumes that the SNP positions in your file are 1-based and you want to match them only with chromStart in the UCSC MySQL table, which is 0-based. You may want to match with chromEnd, or use both chromStart and chromEnd (they are not always equal), in which case the query will be more complex.

Firing off individual queries to the MySQL server is rather slow and inefficient. Try not to hammer their server, or they will probably block you. However, it's not difficult to set up a local MySQL server with only the tables that you need.

ADD COMMENT • link 14.1 years ago by Neilfws 49k

Entering edit mode

I don't really see how that invalidates my solution; however, feel free to provide an answer using a bin query :-)

ADD REPLY • link 14.1 years ago by Neilfws 49k

Entering edit mode

+1: the performance is not an issue here, and Neil's solution is elegant.

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

Entering edit mode

@Neil you're not Australian, you're using the UCSC and not biomart: it's the end of my world for good :-)

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

Entering edit mode

Sorry for keeping downvoting for the same problem: chromStart in the snp132 table is NOT indexed. Understand and use the "bin" field please (see also http://bit.ly/ucscsql).

ADD REPLY • link 14.1 years ago by lh3 33k

Entering edit mode

Your answer gives the right results, but we should discourage inefficient SQL, for you and for others connecting to UCSC MySQL.

ADD REPLY • link 14.1 years ago by lh3 33k

Entering edit mode

For a couple SNPs, not using bin is fine. But Zach has 700 SNPs. Querying 700 SNPs this way is going to be very inefficient. There are simply much better solutions.

ADD REPLY • link 14.1 years ago by lh3 33k

Entering edit mode

14.1 years ago

lh3 33k

Another post advertising my batchUCSC.pl script (see also http://bit.ly/ucscsql). With this script you can:

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -p snp132:::

cat regions.bed | ./batchUCSC.pl -d hg19 -p snp132:::

Note that for most UCSC tables, chromStart and chromEnd are not indexed. For efficient queries, you have to use the mystic "bin" field. You need to write a script to do that. There are a few questions about this.

EDIT: You can print the SQL with "-e". For example:

echo 'chr1 10325 10326' | ./batchUCSC.pl -d hg19 -ep snp132:::

gives

SELECT * FROM snp132 WHERE chrom="chr1" AND chromEnd>=10325 AND chromStart<=10326 AND (bin=1 OR bin=1 OR bin=9 OR bin=73 OR bin=585)

ADD COMMENT • link 14.1 years ago by lh3 33k

Entering edit mode

Hi, could you link to the 'few questions' about using the 'mystic "bin" field' or show us what the script using the bin would look like? Cheers

ADD REPLY • link 14.1 years ago by Eric Normandeau 11k

Entering edit mode

14.1 years ago

Pierre Lindenbaum 166k

To complete Neil's answer. The following C code will add the bin column:

	/**
	* Author: Pierre Lindenbaum PhD (original source is from Jim Kent: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system
	* Motivation: Adds a UCSC 'bin' column see http://biostar.stackexchange.com/questions/8943/get-rs-number-based-on-position
	* Compilation: gcc bin.c
	* Execute: echo -e "chr1\t10326\t10327\trs112750067" \| ./a.out
	* History: updated so it prints all the bins in a at any depth
	*/
	#include <stdio.h>
	#include <stdlib.h>
	#include <string.h>


	static void binsInRange(
	int chromStart,
	int chromEnd,
	int binId,
	int level,
	int binRowStart,
	int rowIndex,
	int binRowCount,
	int genomicPos,
	int genomicLength,
	int first
	)
	{
	if(first!=1) fputc(',',stdout);
	printf("%d",binId);
	if(level<4)
	{
	int i;
	int childLength=genomicLength/8;
	int childBinRowCount=binRowCount*8;
	int childRowBinStart=binRowStart+binRowCount;
	int firstChildIndex=rowIndex*8;
	int firstChildBin=childRowBinStart+firstChildIndex;
	for(i=0;i< 8;++i)
	{
	int childStart=genomicPos+i*childLength;

	if( chromStart>(childStart+childLength) \|\|
	chromEnd<childStart )
	{
	continue;
	}
	binsInRange(
	chromStart,
	chromEnd,
	firstChildBin+i,
	level+1,
	childRowBinStart,
	firstChildIndex+i,
	childBinRowCount,
	childStart,
	childLength,
	0
	);
	}
	}
	}



	static void bins(int chromStart,int chromEnd)
	{
	int genomicLength=536870912;
	binsInRange(chromStart,chromEnd,0,0,0,0,1,0,genomicLength,1);
	}

	static char* readline(FILE* in)
	{
	char* p=NULL;
	int length=0;
	int c;
	while((c=fgetc(in))!=EOF)
	{
	p=realloc(p,(length+2)*sizeof(char));
	if(p==NULL)
	{
	fputs("Out of memory\n",stderr);
	exit(EXIT_FAILURE);
	}
	if(c=='\n')
	{
	p[length]='\0';
	return p;
	}
	p[length]=c;
	p[length+1]='\0';
	length++;
	}
	return p;
	}

	int main(int argc,char** argv)
	{
	char *line;
	while((line=readline(stdin))!=NULL)
	{
	int chromStart=0;
	int chromEnd=0;
	int col=0;
	char *t = strtok(line,"\t");
	while (t != NULL)
	{
	switch(col)
	{
	case 0: fputs(t,stdout);break;
	case 1: chromStart=atoi(t);printf("\t%s",t);break;
	case 2: chromEnd=atoi(t);printf("\t%s\t",t);bins(chromStart,chromEnd);break;
	default:printf("\t%s",t);break;
	}
	t = strtok (NULL, "\t");
	++col;
	}
	fputs("\n",stdout);
	free(line);
	}
	return 0;
	}

view raw bin.c hosted with ❤ by GitHub

compilation:

gcc bin.c

execution:

cat input.txt
1:10327
1:10434
1:10440
1:10469

awk -F '\:' '{printf("chr%s\t%d\t%d\n",$1,int($2)-1,int($2));}' < input.txt |\
./a.out |\
awk '{printf("select name,chromStart+1 from snp132 where chrom=\"%s\" and bin in(%s) and NOT(chromEnd<=%s or chromStart>%s);\n",$1,$4,$2,$2);}' |\
mysql -h genome-mysql.cse.ucsc.edu -u genome -A -D hg19 -N 
rs112750067    10327
rs56289060    10434
rs112155239    10440
rs112766696    10440
rs117577454    10469

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 14.1 years ago by Pierre Lindenbaum 166k

Entering edit mode

note that there will be a problem for the the few snps overlapping two bins. They might not be found, that is why lh3's script would run better as it prints all the parental bins.

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

Entering edit mode

Hey,

Is there a good way to exclude rsIDs that have been merged?

Thanks!

ADD REPLY • link 10.2 years ago by Mark Ebbert ▴ 90

Entering edit mode

14.1 years ago

Giulietta - Ensembl Helpdesk ★ 1.2k

The Ensembl Variant Effect Predictor reveals rsIDs for input genomic positions. It takes standard or VCF formats. You can either go through the Online tool, or use the API script with the Variation API:

http://www.ensembl.org/tools.html

More about the input format is here:

http://www.ensembl.org/info/website/upload/var.html

Installation of the Perl API is here:

http://www.ensembl.org/info/docs/api/variation/index.html

HTH.

ADD COMMENT • link 14.1 years ago by Giulietta - Ensembl Helpdesk ★ 1.2k

Entering edit mode

14.1 years ago

Chris Miller 22k

Do you have a lot of SNPs to query? If so, the easiest way is probably to download the whole snp track from UCSC (under Table Browser > Variation and Repeats), then write a little script to compare your snps to the table.

I'd probably hash my snps by chr:pos, then step through the snp file line by line, outputting only those rs ids that matched my snps.

ADD COMMENT • link 14.1 years ago by Chris Miller 22k

Entering edit mode

Thats a good idea, I only have about 700 and this seems pretty tractable

ADD REPLY • link 14.1 years ago by Zach Stednick ▴ 660

Entering edit mode

14.1 years ago

Hiroyuki Mishima ▴ 160

If you have Ruby-1.9 environment, try BioRuby-UCSC-API. The library uses the 'bin' index system internally. Please see https://github.com/misshie/bioruby-ucsc-api for details.

$ sudo gem install bio-ucsc-api --no-ri --no-rdoc

$ cat input.txt
chr1:10327-10327
chr1:10434-10434
chr1:10440-10440
chr1:10469-10469

$ cat query.rb
#!/usr/local/bin/ruby-1.9
require 'bio-ucsc'
include Bio::Ucsc::Hg19

DBConnection.connect
ARGF.each_line do |row|
  row.chomp!
  iv = Bio::GenomicInterval.parse(row)
  results = Snp132.find_all_by_interval(iv)
  print row
  unless results.empty?
    puts " => #{results.map{|e|e.name}.join(", ")}"
  else
    puts " => none"
  end
end

$ ruby-1.9 query.rb input.txt
chr1:10327-10327 => rs112750067
chr1:10434-10434 => rs56289060
chr1:10440-10440 => rs112155239, rs112766696
chr1:10469-10469 => rs117577454

ADD COMMENT • link 14.1 years ago by Hiroyuki Mishima ▴ 160