LINK SNPS TO GENES USING BIOMART, KEEPING YOUR INITIAL POSITIONS IN THE RESULTING FILE
7 months ago
SGMS

Dear all,

I have a question regarding SNPs to genes mapping using BiomaRt.

I have successfully mapped my chr/start/end SNP positions to ENSIDs and gene symbols. I have done this several times but my question is; Can I keep my original SNP positions in the resulting file which has the chr-start-end gene positions and names? I struggle everytime I need to go back to the original SNP positions to link them to the genes I extracted from BiomaRt. As a picture is worth a 1000 words, please see below:

my file (examples of SNP positions):

chr start       end
1   194972207   194972207
6   41187262    41187262
7   43120222    43120222
7   43120878    43120878


BiomaRt's gene results (examples of gene results):

chromosome_name start_position  end_position    ensembl_gene_id hgnc_symbol
1               119853316       119853748       ENSG00000227205 PFN1P9
1               119886304       119886927       ENSG00000226446 NOTCH2P1


Therefore, I do have the gene list I requested based on my file's positions, but I do not have those positions in the resulting file. That would be really useful to have.

Any help would be greatly appreciated.

7 months ago

Don't use BioMart, use the VEP instead.

7 months ago
Papyrus

What about using GenomicRanges to find the overlaps between the ranges of the two files? Like this: you have these 2 files, in which 2 SNPs belong to the ranges of PFN1P9 and NOTCH2P1 (in your example files there were no matches):

> SNP
chr     start       end
1   1 119853317 119853317
2   6  41187262  41187262
3   1 119893536 119893536
4   7  43120878  43120878
> biomart
chromosome_name start_position end_position ensembl_gene_id hgnc_symbol
1               1      119853316    119853748 ENSG00000227205      PFN1P9
2               1      119893533    119896515 ENSG00000134249      ADAM30
3               1      119886304    119886927 ENSG00000226446    NOTCH2P1


If they are data frame, convert them to GRanges:

library(GenomicRanges)
SNP.gr <- makeGRangesFromDataFrame(SNP, keep.extra.columns = T)
biomart.gr <- makeGRangesFromDataFrame(biomart,
start.field = "start_position",
end.field = "end_position",
keep.extra.columns = T)


Then, use the findOverlaps function to match coordinates between GRanges, like so:

hits <- findOverlaps(query = SNP.gr, subject = biomart.gr)


The hits objects gives you the matching positions between the 2 GRanges. With that info, you can do anything. Such as (a bit sloppy):

mcolsbiomart.gr)$chrSNP <- NA mcolsbiomart.gr)$chrSNP[subjectHits(hits)] <- seqnamesSNP.gr)[queryHits(hits)]
mcolsbiomart.gr)$startSNP <- NA mcolsbiomart.gr)$startSNP[subjectHits(hits)] <- startSNP.gr)[queryHits(hits)]
mcolsbiomart.gr)$endSNP <- NA mcolsbiomart.gr)$endSNP[subjectHits(hits)] <- endSNP.gr)[queryHits(hits)]


And there's the object:

GRanges object with 3 ranges and 5 metadata columns:
seqnames              ranges strand | ensembl_gene_id hgnc_symbol      chrSNP  startSNP    endSNP
<Rle>           <IRanges>  <Rle> |        <factor>    <factor> <character> <integer> <integer>
[1]        1 119853316-119853748      * | ENSG00000227205    PFN1P9             1 119853317 119853317
[2]        1 119893533-119896515      * | ENSG00000134249    ADAM30             1 119893536 119893536
[3]        1 119886304-119886927      * | ENSG00000226446    NOTCH2P1        <NA>      <NA>      <NA>
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

Thank you for this. I have actually used findOverlaps before for another purpose but VEP actually is a much easier solution as soon as you bring your data in the right format. Thanks again for your help:)