How To Find Nearest Gene To A Batch Of Snps
7
3
Entering edit mode
11.4 years ago
Hans ▴ 30

Hi, I'm looking for a tool that can help me find nearest gene to ~64K SNPs. It would be great if the tool allows me to define the number of kb to search up- and downstream of the SNP.

I appriciate any help, hints and tips!

snp gene gene • 17k views
1
Entering edit mode

You might like to look in the related questions box (right sidebar) for this; it's been answered several times before.

5
Entering edit mode
11.4 years ago
Sander Timmer ▴ 710

I'm not sure if any tools exist that can do this straight out of the box.... Though I know for sure that within a few lines of Perl you can write this using the Ensembl API. You could even think of just using the Ensembl Mysql and then select for transcripts with the start/stop position closest by your SNP location.

3
Entering edit mode

In fact, the Feature object in the Ensembl Core API, and consequently also the VariationFeature object in the Variation API, has a method called get_nearest_Gene. But even without this is should be easy, I have done it myself in the past and might have still have the script lying around somewhere ....

4
Entering edit mode
11.4 years ago

You can use the UCSC mysql database to find the SNPs at a given distance of a gene. See my previous answer for a related question: How To Map A Snp To A Gene Around +/- 60Kb ?

This won't return THE closest gene, but it's a part of your solution.

3
Entering edit mode
9.6 years ago

If your data are in UCSC BED format — or can be converted to UCSC BED format with, say, BEDOPS conversion scripts — then BEDOPS closest-features is very efficient at solving this problem. I'll describe below how you might approach this using BEDOPS and standard UNIX tools.

To demonstrate, let's say we start with a set of SNPs in a VCF file, and we have a set of GENCODE genes in GTF format. We'll convert these two files to sorted BED files and ask some questions with them.

First, we convert the VCF file:

$vcf2bed < mySNPs.vcf > mySNPs.bed  Next, let's assume we're working with human data and that we visit the GENCODE site to grab GENCODE v15 records in GTF format. We can use BEDOPS gtf2bed to convert this to a sorted UCSC BED-formatted file containing GENCODE genes: $ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene \
> gencodeGenes.bed


(This doesn't take long to do, but since releases only come out once in a while, you probably shouldn't have to do this last step too often.)

Now that both genes and SNPs are in sorted BED files, we can find the nearest gene to a SNP with closest-features:

$closest-features --closest --dist mySNPs.bed gencodeGenes.bed > answer.bed  The file answer.bed is a sorted BED file that contains each SNP from mySNPs.bed, the closest GENCODE gene to it from gencodeGenes.bed, and the distance between the SNP and the gene in bases. The abstract format of each line of the result in this file is as follows, delimited with pipe characters ("|"): [ SNP-1 ] | [ gene-nearest-to-SNP-1 ] | [ distance-between-SNP-1-and-gene ] [ SNP-2 ] | [ gene-nearest-to-SNP-2 ] | [ distance-between-SNP-2-and-gene ] ... [ SNP-x ] | [ gene-nearest-to-SNP-x ] | [ distance-between-SNP-x-and-gene ]  We can use this distance value with awk to quickly filter elements on some criteria of interest (say, we want genes that are no further away from SNPs than n bases). As an example, let's say we want all SNPs where the nearest gene is no more than 1000 bases away from its SNP: $ awk -F "|" -v max_distance=1000 '{ \
if ($3 <= max_distance) { \ print$0; \
} \


As you can see in the examples above, BEDOPS is designed to work with standard input and output to make it easy to string commands together into a pipeline. So if you need to repeat experiments with different inputs, you might do everything just described in just a few lines of code and put it into a shell script:

$wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz | gunzip -c - \ | gtf2bed - \ | grep -w gene \ > gencodeGenes.bed$ vcf2bed < mySNPs.vcf \
| closest-features --closest --dist - gencodeGenes.bed \
| awk -F "|" -v max_distance=100 '{ \
if ($3 <= max_distance) { \ print$0; \
} \
}' - \


Eliminating two intermediate files (mySNPs.bed and answer.bed) reduces file I/O. Along with other efficiency advancements introduced by BEDOPS, this pipeline is fairly efficient.

1
Entering edit mode
11.4 years ago
Rhys ▴ 10

NeibG (http://wuxbl.scripts.mit.edu/neibg/neibg.php) might do the job but it won't take SNP ids as input, they need to be converted to a locus first.

1
Entering edit mode
9.6 years ago

Are you lookink only for SNP flanking regions or genes ?

If you are only willing to get flanking regions of a SNP there is this method that can help you.

Otherwise, I suggest you do this to get the nearest gene :

1- extract your SNP in a bed format with a chr, start, end. 2- Get all refGenes from UCSC in a bed format as well 3- Use bedtools "closestBed" to get sens of the nearest gene to your SNP region

That's a suggestion that might be improved of course

0
Entering edit mode
11.4 years ago
Leszek 4.2k

Are those SNP from particular species? And do you have SNP coordinates, or just flanking sequence?

If you have coordinates, get your genome annotation (GFF or GTF) and simply parse it looking for nearest coding sequence. I don't know tools for that, but writting your own script wouldn't be that difficult:)

If you don't have SNP coordinates, you will need to map those on the genome first (f.e. BLAT).

0
Entering edit mode
3.9 years ago

https://snp-nexus.org/index.html

I found this is a great webpage. You can query using text file or vcf file here.