Obtain chromosome, position, and alleles based on a list of SNP names
5
2
Entering edit mode
6.0 years ago
jiumeng66 ▴ 40

I have a SNP list (including 3660 SNPs), which have only the name, like rs41457244.

Now I need the other information of the SNPs, such as chromosomes, positions (hg19), and alleles. What can I do?

The following is part of my SNP list:

 rs2088175
 rs2983855
 rs2821958
 rs41469446
 rs619987
 rs2877425
 rs41447048
 rs41497748
 rs503808
 rs386628
 rs6667995
 rs41405345
SNP • 5.3k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Please add an example/more detail or move your suggestion to a comment. Thank you!

ADD REPLY
0
Entering edit mode

Added link for the package tutorial.

ADD REPLY
0
Entering edit mode

IMO it still needs work to qualify as an answer. The package name and package manual link are effectively just a suggestion now. I'm moving this to a comment.

ADD REPLY
4
Entering edit mode
6.0 years ago
$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,chromStart,chromEnd,name,alleles,alleleFreqs from snp150 where name in ("rs2088175","rs2983855","rs2821958","rs41469446","rs619987","rs2877425","rs41447048","rs41497748","rs503808","rs386628","rs6667995","rs41405345")'
+---------------------+------------+-----------+------------+---------+--------------------+
| chrom               | chromStart | chromEnd  | name       | alleles | alleleFreqs        |
+---------------------+------------+-----------+------------+---------+--------------------+
| chr3                |   76216220 |  76216221 | rs2088175  | C,G,    | 0.875399,0.124601, |
| chr21               |   13597605 |  13597606 | rs2821958  | A,G,    | 0.884181,0.115819, |
| chr4                |   68990362 |  68990363 | rs2877425  | A,G,    | 0.085463,0.914537, |
| chr4_GL000257v2_alt |     566381 |    566382 | rs2877425  | A,G,    | 0.085463,0.914537, |
| chr10               |   38291051 |  38291052 | rs2983855  | C,T,    | 0.468850,0.531150, |
| chr6                |   28816618 |  28816619 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000250v2_alt |      82197 |     82198 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000251v2_alt |     307169 |    307170 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000252v2_alt |      82220 |     82221 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000253v2_alt |      82185 |     82186 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000254v2_alt |      82211 |     82212 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000255v2_alt |      82201 |     82202 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr6_GL000256v2_alt |     125864 |    125865 | rs386628   | C,T,    | 0.339257,0.660743, |
| chr17               |   22206798 |  22206799 | rs41447048 | C,T,    | 0.833067,0.166933, |
| chr18               |   14208563 |  14208564 | rs41469446 | A,G,    | 0.868011,0.131989, |
| chr1                |   83314732 |  83314733 | rs41497748 | A,T,    | 0.831669,0.168331, |
| chr2                |  126689276 | 126689277 | rs503808   | A,G,    | 0.647364,0.352636, |
| chr1                |  149015643 | 149015644 | rs619987   | A,T,    | 0.869010,0.130990, |
| chr1                |  205854053 | 205854054 | rs6667995  | A,C,    | 0.758586,0.241414, |
+---------------------+------------+-----------+------------+---------+--------------------+
ADD COMMENT
0
Entering edit mode

thank you so much. but in my terminal, the command return "-bash: mysql: command not found". Dose it mean that some tool need be installed?

ADD REPLY
1
Entering edit mode

Dose it mean that some tool need be installed?

ubuntu:

apt-get install mysql-client
ADD REPLY
0
Entering edit mode

They have 3660 SNPs. Any way to automate this bit of sql code: ... in ("rs2088175","rs2983855","rs2821958", ... "rs_TheLastSNP")' ?

ADD REPLY
0
Entering edit mode

They can use sed. With an ID-per-line file, replace each line with "\1", then use tr to replace newlines with commas. An additional sed to delete the last comma might be required.

Note: Exact code not provided to encourage learning

ADD REPLY
2
Entering edit mode
6.0 years ago
Emily 23k

BioMart

Use the short variation database, filter by your list of IDs, get location and alleles as attributes.

Example query

ADD COMMENT
1
Entering edit mode
6.0 years ago
igor 13k

There are a lot of excellent responses already, but I wanted to offer an alternate solution.

You can also download SNPs in a table format from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/snp150.txt.gz

Since these are in a table format, they are a little easier to read and process without specialized tools. For example, you can run:

zcat snp150.txt.gz | grep -w -f snplist.txt

Where snplist.txt is your list of SNPs.

ADD COMMENT
0
Entering edit mode
6.0 years ago

If you download the dbSNP of your genome, you will have all the informations you need. Just look into the dbSNP file with your SNPs list using Awk, Python, Perl...

ADD COMMENT
0
Entering edit mode

Thank you, but where can I download the dbSNP?

ADD REPLY
0
Entering edit mode

hg19 :

In your browser

ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz

or in your terminal

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/dbsnp_138.hg19.vcf.gz
ADD REPLY
0
Entering edit mode

sorry to reply lately, I have download it just now with your help. Thank you very much.

ADD REPLY
0
Entering edit mode
6.0 years ago

If you want to do things locally, you can make a searchable resource you can query as you like.

1) Get SNPs and write them into a text file sorted by SNP ID.

For hg19, for instance, using BEDOPS convert2bed to convert VCF to BED:

$ LC_ALL=C
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz \
   | gunzip -c \
   | convert2bed --input=vcf --sort-tmpdir=${PWD} - \
   | awk -v OFS="\t" '{ print $4,"chr"$1,$2,$3,$6,$7; }' \
   | sort -k1,1 \
   > hg19.snp150.sortedByName.txt

This text file includes the SNP rsID, the genomic position, and the reference and alternate alleles. It is sorted by the SNP rsID, a property which we will use to enable fast searches.

2) Install pts-line-bisect, which does a binary search on lexicographically-sorted files, such as the one that was just made:

$ git clone https://github.com/pts/pts-line-bisect.git
$ cd pts-line-bisect && make && cd ..

Binary searches are pretty fast and great for write-once, read-many applications like this.

3) Run a query. The following command would return a six-column BED file:

$ rs_of_interest=rs41457244
$ ./pts-line-bisect/pts_lbsearch -p hg19.snp150.sortedByName.txt ${rs_of_interest} \
   | head -1 \
   | awk -v OFS="\t" '{ print "chr"$2,$3,$4,$1,$5,$6; }'

Step 3 can be put into a script so that you can re-run it with your SNP of interest on demand.

For instance:

#!/bin/bash
pts_lbsearch_bin=./pts-line-bisect/pts_lbsearch
sorted_snp_txt=hg19.snp150.sortedByName.txt
${pts_lbsearch_bin} -p ${sorted_snp_txt} $1 | head -1 | awk -v OFS="\t" '{ print "chr"$2,$3,$4,$1,$5,$6; }'

Then:

$ ./search_snps.sh rs41457244
...

If you have a file of rs* IDs, you could loop through them via bash:

$ while read rsID; do ./search_snps.sh $rsID; done < snpIDs.txt | sort-bed - > snps.bed

Writing a sorted BED file can be useful for enabling set operations with BEDOPS and other toolkits.

ADD COMMENT
1
Entering edit mode

Thank you, it is a little bit complex. I will try.^_^

ADD REPLY
0
Entering edit mode

No need, if the other solutions work for you. Once set up, however, it is just a way to query stuff very quickly, repeatedly, without going over a potentially slow network. Lots of great options here, regardless.

ADD REPLY

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6