Question: Assigning SNPs to gene names based on genomic coordinates
1
gravatar for paolo002
10 weeks ago by
paolo002130
paolo002130 wrote:

Hi all, I am not sure if this was asked before but I would like to assign a list of SNPs to the corresponding gene names:

I have a df such as the following:

           SNP std.XPEHH CHR       BP     rank
1  rs376121383   5.86403   2 97868880       1
2    rs9324211   5.86122   2 97867870       2
3   rs12445490   5.85442  16  5570974       3
4  rs201474380   5.84314   2 97868313       4
5  rs190916690   5.79274   2 97868544       5
6  rs112920594   5.79176   2 97866321       6
7  rs372154564   5.79176   2 97866355       7
8   rs13409957   5.78445   2 97865894       8
9   2:97868257   5.78096   2 97868257       9
10 rs116237095   5.78096   2 97868240      10
11 rs181446057   5.77994   2 97868213      11
12 rs370914866   5.77985   2 97867072      12
13  2:97865943   5.77978   2 97865943      13
14 rs115004312   5.76058   2 97866683      14
15 rs192731212   5.76058   2 97866908      15
16 rs371563138   5.75741   2 97867703      16
17 rs111558429   5.75590   2 97867824      17
18   rs9636497   5.73166   2 97868926      18
19   rs3926003   5.71338   2 97870247      19
20  rs12327919   5.70304   2 97866928      20
21 rs116820074   5.64604  16  5570936      21
22  16:5570973   5.62932  16  5570973      22
23 rs115661023   5.61264   2 97865044      23
24  rs55879702   5.60655   2 97866370      24
25 rs377749588   5.60236   2 97865217      25
26 rs148207230   5.59137  16  5564330      26
27  rs10169898   5.57194   2 61103311      27
28   rs9636495   5.56266   2 97869056      28
29   rs8053291   5.53916  16  5566143      29
30  rs10168001   5.53839   2 97867470      30
....

BP is the coordinate of the corresponding SNP in base pair on the genome. Some of the SNPs are not annotated, so the ID is like number 9 in the df, with the chromosome number and base pair location. I would like to know in which gene the BP number is falling and so I can assign SNPs to their gene. I would like to do this in R and the gene name should be accordingly to hg19 UCSC and not ESEMBL, it would be better. Any help highly appreciated, thanks a lot Cheers

R • 210 views
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by paolo002130
2

This has been asked before many times, search for "map SNP to gene":

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by zx87546.5k

ok thanks a lot for all the links. I am taking a look.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by paolo002130
1
gravatar for paolo002
10 weeks ago by
paolo002130
paolo002130 wrote:

Hi, I have found a nice solution for doing that in R, I post it so maybe it might be useful to other people:

Let' say you have a data frame (df) with list of SNPs ID (rs IDs and not annotated IDs) with chromosome number and location in Base Pairs (BP) on the genome ( I show the first 10 rows but the df continues with other many rows):

           SNP std.XPEHH  CHR    BP      rank
1  rs376121383   6.48482 chr2 97868880         1
2  rs201474380   6.47901 chr2 97868313         2
3    rs3926003   6.46181 chr2 97870247         3
4  rs112920594   6.42510 chr2 97866321         4
5  rs372154564   6.42510 chr2 97866355         5
6  rs190916690   6.42453 chr2 97868544         6
7  rs370914866   6.41392 chr2 97867072         7
8   2:97865943   6.40923 chr2 97865943         8
9   2:97868257   6.40907 chr2 97868257         9
10 rs116237095   6.40907 chr2 97868240        10

In order to get the following output data frame:

   ENTREZID      rank         SNP std.XPEHH CHR    BP   GENE NAME
1     11066         1 rs376121383   6.48482 chr2 97868880 SNRNP35
2     11066         2 rs201474380   6.47901 chr2 97868313 SNRNP35
3     11066         3   rs3926003   6.46181 chr2 97870247 SNRNP35
4     11066         5 rs372154564   6.42510 chr2 97866355 SNRNP35
5     11066         4 rs112920594   6.42510 chr2 97866321 SNRNP35
6     11066         6 rs190916690   6.42453 chr2 97868544 SNRNP35
7     11066         7 rs370914866   6.41392 chr2 97867072 SNRNP35
8     11066         8  2:97865943   6.40923 chr2 97865943 SNRNP35
9     11066         9  2:97868257   6.40907 chr2 97868257 SNRNP35
10    11066        10 rs116237095   6.40907 chr2 97868240 SNRNP35

It is possible to add the corresponding gene name to the SNP location by converting the df to GRanges object and use the findOverlaps function to check which of the SNPs coordinates are intersecting in which genes (based on their coordinates in the TxDb.Hsapiens.UCSC.hg19.knownGene database as GRanges object):

#To find overlapping SNPs coordinates to gene coordinates  Load following libraries
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
head(df)
# add rank column to the df (this will be usuful later on when fining overlapping rows, already added in the previous df)
df <- df %>% mutate(rank = 1:nrow(df))
# add "chr" before chromosome number in the data frame df in the CHR column (already added in the previous df)
df$CHR <- sprintf('chr%s', df$CHR)
# convert the data frame to GRanges object,
# it takes the chromosome number and the SNP coordinate on Base Pair (BP)
SNPs.gr <- GRanges( seqnames=df$CHR,
                    IRanges(start=df$BP,
                            width=1))
headSNPs.gr)    
# add the info of the reference genome
genomeSNPs.gr) <- 'hg19'
# get the coordinates of all the genes from database
txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
head(txs)
# find which coordinates of the SNP list are folling into the coordinates of the gene list in the data base
df_overlap<-findOverlaps(df,txs)
#modify name of first column of the previous output to merge with original df
names(df_overlap)[1]<-"rank"
df_merged<-merge(df, df_overlap, by="rank", all=T)
head(df_merged)

# To convert to ENTREZ IDs to HUGO gene names:
library(EnsDb.Hsapiens.v86)
hsens=EnsDb.Hsapiens.v86
my.symbols <- as.character(df_merged$subjectHits)
df_HUGO<-select(hsens,  
       keys = my.symbols, 
       columns = c("ENTREZID", "SYMBOL"), 
       keytype = "ENTREZID")
names(df_merged)[6]<-"ENTREZID"
df_final<-merge(df_merged, df_HUGO, by="ENTREZID", all=T)
ADD COMMENTlink modified 10 weeks ago • written 10 weeks ago by paolo002130
1
gravatar for Emily_Ensembl
10 weeks ago by
Emily_Ensembl17k
EMBL-EBI
Emily_Ensembl17k wrote:

The Ensembl VEP will tell you what genes they map to and the effects they have.

ADD COMMENTlink written 10 weeks ago by Emily_Ensembl17k

4th link covers this already, why answer? Maybe update/edit your existing answer from that link?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by zx87546.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1903 users visited in the last hour