How To Map A Snp To A Gene Around +/- 60Kb ?
5
22
Entering edit mode
11.5 years ago

I am looking at a bunch of SNPs. Some of them are part of genes, but other are not. I am interested to look up +60KB or -60KB of those SNPs to get details about some nearby genes. Please share your experience in dealing with such a situation or thoughts on any methods that can do this. Thanks in advance.

snp data human annotation • 21k views
0
Entering edit mode

If you just want gene names within 60kb of the SNP:

$bedmap --echo --echo-map-id-uniq --range 60000 <(vcf2bed snps.vcf) genes.bed  If you want their locations, also: $ bedmap --echo --echo-map --range 60000 <(vcf2bed snps.vcf) genes.bed


These are operations to retrieve genes within 60kb of a SNP.

If you want the very nearest gene to a SNP, you can use closest-features:

\$ closest-features --closest --dist <(vcf2bed snps.vcf) genes.bed > answer.bed


This gives you, for each SNP, its nearest gene and the distance value in bases.

These and other genomic set operation tools are part of the BEDOPS toolkit.

21
Entering edit mode
11.5 years ago

Using the UCSC mysql server:

~> mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e '
select
K.proteinID,
K.name,
S.name,
S.avHet,
S.chrom,
S.chromStart,
K.txStart,
K.txEnd
from snp130 as S
left join knownGene as K on
(S.chrom=K.chrom and not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart))
where
S.name in ("rs25","rs100","rs75","rs9876","rs101")'


result:

+-----------+------------+--------+----------+-------+------------+----------+----------+
| proteinID | name       | name   | avHet    | chrom | chromStart | txStart  | txEnd    |
+-----------+------------+--------+----------+-------+------------+----------+----------+
| NULL      | NULL       | rs100  |        0 | chr7  |   24438348 |     NULL |     NULL |
| NULL      | NULL       | rs101  |        0 | chr7  |   24438147 |     NULL |     NULL |
| NP_056019 | uc003ssf.3 | rs25   | 0.499586 | chr7  |   11584141 | 11414172 | 11871824 |
| NP_056019 | uc003ssf.3 | rs75   | 0.241967 | chr7  |   11613691 | 11414172 | 11871824 |
| B2RNV1    | uc003tnv.2 | rs9876 | 0.426096 | chr7  |   47315290 | 47314752 | 47579199 |
| B2RNV1    | uc003tnw.2 | rs9876 | 0.426096 | chr7  |   47315290 | 47314752 | 47621742 |
+-----------+------------+--------+----------+-------+------------+----------+----------+

4
Entering edit mode

Pierre, this is great solution. I completely forgot about mysql server. However, I'd recommend to use refFlat table instead of knownGene and K.geneName instead of K.proteinID. In this case you will get HUGO gene name. Anyway, it's up to OP.

1
Entering edit mode

Very nice solution - anything that takes us away from ad hoc Perl scripts is good!

1
Entering edit mode

Recently, a question similar to this question was asked in BioBB. URL : http://www.bioinformatics.org/pipermail/bbb/2010-April/004849.html

0
Entering edit mode

Excellent ! I was looking for something like this. Thanks a lot Pierre.

0
Entering edit mode

Ditto. Very nice solution, I didn't even know that server existed. looks like they may be getting a bit more traffic in the future.

0
Entering edit mode

I wanted to limit my results to items with a geneSymbol so I added an extra join:

select S.name as rsId, kgXref.geneSymbol from snp131 as S left join knownGene as K on (S.chrom=K.chrom and not(K.txEnd+60000

0
Entering edit mode

I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files. Bu picking 'selected fields from primary and related tables' I was able to just get the fields I wanted. I then used Toad for MySQL (Freeware on Windows) to import the flat file to a local db in a single step.

0
Entering edit mode

I found the UCSC server unusably slow. I used the UCSC Table browser to export tab delimited files for the snp131 and refFlat (per Yuri above) tables. By choosing 'selected fields from primary and related tables' from drop down menu I was able to get only the the fields I wanted, thereby saving time and disk space. I then used Toad for MySQL (Freeware on Windows) to import the flat files to a local db in a single step each.

0
Entering edit mode

quite old but still - doesn't "not(K.txEnd+60000<S.chromStart" also mean "all genes that end more than 60kb away from snp" to the right?

0
Entering edit mode

'not(K.txEnd+60000<S.chromStart or S.chromEnd+60000<K.txStart)' means if the gene end +60KB is before the SNP or the gene start - 60Kb if after the SNP the DO NOT accept the SNP.

0
Entering edit mode

Sorry, could you please tell me exactly what I have to type on the command line? If I copy this mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A in MySQL 5.7 Command Line Client I get an error in SQL syntax...thank you

0
Entering edit mode

2019-11-20: the sql server has changed to mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306

8
Entering edit mode
11.2 years ago

I imagine you have this sorted out by now, but I thought I'd add that assuming your SNPs are in BED/ROD or VCF format and your genes are in BED12 or GFF format, then you can easily do this with a single command in my BEDTools suite. The relevant command would be:

windowBed -a snps.bed -b genes.bed -w 60000 > snps.withGenesWithin60Kb.bed

0
Entering edit mode

0
Entering edit mode

Hi Khader, Just wondering if you managed to try it with BEDTools? I'm trying to do a similar thing but not very familiar with mySQL. Also, may I just ask on what basis/why you chose a +/- 60kb window size? Thanks!

5
Entering edit mode
11.5 years ago

No need to program this. This can be done as a BioMart query in MartView:

rs8 with 60000 bases up/donw-flanks

I just checked it works with that size flanks. The result is a FASTA file.

On the other hand, if you have a SNP position it could be easier to simply search for the genes directly in the genome annotation. Get a genome annotation as a tab separated file from e.g. BioMart and search for the genes with start/stop positons within this range.

0
Entering edit mode

Thanks Michael.

0
Entering edit mode

I am afraid this is not exactly I am looking for. I wanted to check if a particular SNP is not mapped to a gene, whether there will be any known genes in +/- 60KB of that SNP. This particular BioMart query is fetching the sequence not annotation. I have tried one with annotation and it works. : http://www.biomart.org/biomart/martview/c26efcafd362abccc4af83039c441762. But sometimes BioMart out put requires lot of post-processing. Also it will be nice if BioMart can have options to select fields multiple attributes.

0
Entering edit mode

Yes, sorry, I misunderstood your question. What Pierre described is much better here.

2
Entering edit mode
11.5 years ago
Yuri ★ 1.6k

I downloaded refFlat table from UCSC Genome Browser, which contains gene symbols and their genomic locations. Then mapped SNPs to genes. I did it in MATLAB, but can be done with any language.

You can also try a Perl script from here.