Question

LINK SNPS TO GENES USING BIOMART, KEEPING YOUR INITIAL POSITIONS IN THE RESULTING FILE

0

Entering edit mode

4.7 years ago

SGMS ▴ 130

Dear all,

I have a question regarding SNPs to genes mapping using BiomaRt.

I have successfully mapped my chr/start/end SNP positions to ENSIDs and gene symbols. I have done this several times but my question is; Can I keep my original SNP positions in the resulting file which has the chr-start-end gene positions and names? I struggle everytime I need to go back to the original SNP positions to link them to the genes I extracted from BiomaRt. As a picture is worth a 1000 words, please see below:

my file (examples of SNP positions):

chr start       end
1   194972207   194972207
6   41187262    41187262
7   43120222    43120222
7   43120878    43120878

BiomaRt's gene results (examples of gene results):

chromosome_name start_position  end_position    ensembl_gene_id hgnc_symbol
1               119853316       119853748       ENSG00000227205 PFN1P9
1               119886304       119886927       ENSG00000226446 NOTCH2P1
1               119893533       119896515       ENSG00000134249 ADAM30

Therefore, I do have the gene list I requested based on my file's positions, but I do not have those positions in the resulting file. That would be really useful to have.

Any help would be greatly appreciated.

Thank you in advance

biomart snps genes mapping • 1.4k views

ADD COMMENT • link updated 4.7 years ago by Emily 24k • written 4.7 years ago by SGMS ▴ 130

score 1 · Answer 1 · 2020-11-02

1

Entering edit mode

4.7 years ago

Emily 24k

Don't use BioMart, use the VEP instead.

ADD COMMENT • link 4.7 years ago by Emily 24k

score 0 · Answer 2 · 2020-10-30

What about using GenomicRanges to find the overlaps between the ranges of the two files? Like this: you have these 2 files, in which 2 SNPs belong to the ranges of PFN1P9 and NOTCH2P1 (in your example files there were no matches):

> SNP
  chr     start       end
1   1 119853317 119853317
2   6  41187262  41187262
3   1 119893536 119893536
4   7  43120878  43120878
> biomart
  chromosome_name start_position end_position ensembl_gene_id hgnc_symbol
1               1      119853316    119853748 ENSG00000227205      PFN1P9
2               1      119893533    119896515 ENSG00000134249      ADAM30
3               1      119886304    119886927 ENSG00000226446    NOTCH2P1

If they are data frame, convert them to GRanges:

library(GenomicRanges)
SNP.gr <- makeGRangesFromDataFrame(SNP, keep.extra.columns = T)
biomart.gr <- makeGRangesFromDataFrame(biomart, 
                                       start.field = "start_position",
                                       end.field = "end_position",
                                       keep.extra.columns = T)

Then, use the findOverlaps function to match coordinates between GRanges, like so:

hits <- findOverlaps(query = SNP.gr, subject = biomart.gr)

The hits objects gives you the matching positions between the 2 GRanges. With that info, you can do anything. Such as (a bit sloppy):

mcolsbiomart.gr)$chrSNP <- NA
mcolsbiomart.gr)$chrSNP[subjectHits(hits)] <- seqnamesSNP.gr)[queryHits(hits)]
mcolsbiomart.gr)$startSNP <- NA
mcolsbiomart.gr)$startSNP[subjectHits(hits)] <- startSNP.gr)[queryHits(hits)]
mcolsbiomart.gr)$endSNP <- NA
mcolsbiomart.gr)$endSNP[subjectHits(hits)] <- endSNP.gr)[queryHits(hits)]

And there's the object:

GRanges object with 3 ranges and 5 metadata columns:
      seqnames              ranges strand | ensembl_gene_id hgnc_symbol      chrSNP  startSNP    endSNP
         <Rle>           <IRanges>  <Rle> |        <factor>    <factor> <character> <integer> <integer>
  [1]        1 119853316-119853748      * | ENSG00000227205    PFN1P9             1 119853317 119853317
  [2]        1 119893533-119896515      * | ENSG00000134249    ADAM30             1 119893536 119893536
  [3]        1 119886304-119886927      * | ENSG00000226446    NOTCH2P1        <NA>      <NA>      <NA>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths