Question: How To Find Nearest Gene To A Batch Of Snps
gravatar for Hans
8.1 years ago by
Hans10 wrote:

Hi, I'm looking for a tool that can help me find nearest gene to ~64K SNPs. It would be great if the tool allows me to define the number of kb to search up- and downstream of the SNP.

I appriciate any help, hints and tips!

gene snp • 11k views
ADD COMMENTlink modified 7 months ago by yupadhya0 • written 8.1 years ago by Hans10

You might like to look in the related questions box (right sidebar) for this; it's been answered several times before.

ADD REPLYlink written 8.1 years ago by Neilfws48k
gravatar for Sander Timmer
8.1 years ago by
Sander Timmer700
United Kingdom
Sander Timmer700 wrote:

I'm not sure if any tools exist that can do this straight out of the box.... Though I know for sure that within a few lines of Perl you can write this using the Ensembl API. You could even think of just using the Ensembl Mysql and then select for transcripts with the start/stop position closest by your SNP location.

ADD COMMENTlink written 8.1 years ago by Sander Timmer700

In fact, the Feature object in the Ensembl Core API, and consequently also the VariationFeature object in the Variation API, has a method called get_nearest_Gene. But even without this is should be easy, I have done it myself in the past and might have still have the script lying around somewhere ....

ADD REPLYlink written 8.1 years ago by Bert Overduin3.6k
gravatar for Pierre Lindenbaum
8.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

You can use the UCSC mysql database to find the SNPs at a given distance of a gene. See my previous answer for a related question:

This won't return THE closest gene, but it's a part of your solution.

ADD COMMENTlink written 8.1 years ago by Pierre Lindenbaum122k
gravatar for Alex Reynolds
6.3 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If your data are in UCSC BED format — or can be converted to UCSC BED format with, say, BEDOPS conversion scripts — then BEDOPS closest-features is very efficient at solving this problem. I'll describe below how you might approach this using BEDOPS and standard UNIX tools.

To demonstrate, let's say we start with a set of SNPs in a VCF file, and we have a set of GENCODE genes in GTF format. We'll convert these two files to sorted BED files and ask some questions with them.

First, we convert the VCF file:

$ vcf2bed < mySNPs.vcf > mySNPs.bed

Next, let's assume we're working with human data and that we visit the GENCODE site to grab GENCODE v15 records in GTF format. We can use BEDOPS gtf2bed to convert this to a sorted UCSC BED-formatted file containing GENCODE genes:

$ wget -O - \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > gencodeGenes.bed

(This doesn't take long to do, but since releases only come out once in a while, you probably shouldn't have to do this last step too often.)

Now that both genes and SNPs are in sorted BED files, we can find the nearest gene to a SNP with closest-features:

$ closest-features --closest --dist mySNPs.bed gencodeGenes.bed > answer.bed

The file answer.bed is a sorted BED file that contains each SNP from mySNPs.bed, the closest GENCODE gene to it from gencodeGenes.bed, and the distance between the SNP and the gene in bases. The abstract format of each line of the result in this file is as follows, delimited with pipe characters ("|"):

[ SNP-1 ] | [ gene-nearest-to-SNP-1 ] | [ distance-between-SNP-1-and-gene ]
[ SNP-2 ] | [ gene-nearest-to-SNP-2 ] | [ distance-between-SNP-2-and-gene ]
[ SNP-x ] | [ gene-nearest-to-SNP-x ] | [ distance-between-SNP-x-and-gene ]

We can use this distance value with awk to quickly filter elements on some criteria of interest (say, we want genes that are no further away from SNPs than n bases).

As an example, let's say we want all SNPs where the nearest gene is no more than 1000 bases away from its SNP:

$ awk -F "|" -v max_distance=1000 '{ \
    if ($3 <= max_distance) { \
        print $0; \
    } \
}' answer.bed > filteredAnswer.bed

As you can see in the examples above, BEDOPS is designed to work with standard input and output to make it easy to string commands together into a pipeline. So if you need to repeat experiments with different inputs, you might do everything just described in just a few lines of code and put it into a shell script:

$ wget -O -
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene \
    > gencodeGenes.bed

$ vcf2bed < mySNPs.vcf \
    | closest-features --closest --dist - gencodeGenes.bed \
    | awk -F "|" -v max_distance=100 '{ \
        if ($3 <= max_distance) { \
            print $0; \
        } \
    }' - \
    > filteredAnswer.bed

Eliminating two intermediate files (mySNPs.bed and answer.bed) reduces file I/O. Along with other efficiency advancements introduced by BEDOPS, this pipeline is fairly efficient.

ADD COMMENTlink modified 6.2 years ago • written 6.3 years ago by Alex Reynolds28k
gravatar for Rhys
8.1 years ago by
Rhys10 wrote:

NeibG ( might do the job but it won't take SNP ids as input, they need to be converted to a locus first.

ADD COMMENTlink written 8.1 years ago by Rhys10
gravatar for Rad
6.3 years ago by
Rad800 wrote:

Are you lookink only for SNP flanking regions or genes ?

If you are only willing to get flanking regions of a SNP there is this method that can help you.

Otherwise, I suggest you do this to get the nearest gene :

1- extract your SNP in a bed format with a chr, start, end. 2- Get all refGenes from UCSC in a bed format as well 3- Use bedtools "closestBed" to get sens of the nearest gene to your SNP region

That's a suggestion that might be improved of course

ADD COMMENTlink written 6.3 years ago by Rad800
gravatar for Leszek
8.1 years ago by
IIMCB, Poland
Leszek4.0k wrote:

Are those SNP from particular species? And do you have SNP coordinates, or just flanking sequence?

If you have coordinates, get your genome annotation (GFF or GTF) and simply parse it looking for nearest coding sequence. I don't know tools for that, but writting your own script wouldn't be that difficult:)

If you don't have SNP coordinates, you will need to map those on the genome first (f.e. BLAT).

ADD COMMENTlink written 8.1 years ago by Leszek4.0k
gravatar for yupadhya
7 months ago by
yupadhya0 wrote:

I found this is a great webpage. You can query using text file or vcf file here.

ADD COMMENTlink modified 7 months ago • written 7 months ago by yupadhya0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1540 users visited in the last hour