biomaRt search for a list of values [dplyr + column(s)]
Entering edit mode
3.8 years ago
deepue ▴ 160


I would like to query biomaRt databases for retrieving Ensembl Gene IDs (ensembl_gene_stable_id) for a list of SNPs (snp_filter) from the user input testData$rsNum in a tidyverse way.

testData <- readr::read_tsv("rs1467475747   8       148357
rs1378018226    8   148383
rs546813474 8   148402
rs1175049916    8   148522
rs1187272067    8   148523
rs1427441701    8   148553
rs201635470 8   148556
rs1483428031    8   148608
rs1251102826    8   148610", 
                     col_names = c("rsNum", "chrNum", "pos"), 
                     col_types = "cii")

I attempted to pass the filters as column names as below:

grch37.snp = useMart(biomart="ENSEMBL_MART_SNP", host="", dataset="hsapiens_snp")
testData %>% getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "chrom_end", 
                                "ensembl_gene_stable_id", "associated_gene"), 
                   filters=c("snp_filter", "chr_name", "start", "end"), 
                   values=list(rsNum, chrNum, pos, pos), 
                   mart=grch37.snp, uniqueRows=TRUE)

which resulted in the error:

Error in getBM(., attributes = c("refsnp_id", "chr_name", "chrom_start", : object 'rsNum' not found

Is there any error in this approach of querying the marts?

However, I have also found the workaround to achieve the purpose in another way (source):

getBM(attributes=c("refsnp_id", "chr_name", "chrom_start", "chrom_end", 
                                "ensembl_gene_stable_id", "associated_gene"), 
                   filters=c("snp_filter", "chr_name", "start", "end"), 
                   values=list(testData$rsNum, testData$chrNum, testData$pos, testData$pos), 
                   mart=grch37.snp, uniqueRows=TRUE)

Though the later command achieves the expected output, I am looking forward to an option in the former approach by passing only the column name (rsNum, chrNum, pos, pos). Are you aware of any possibilities?

Thanks for your interest to answer the question.

BiomaRt Ensembl • 1.1k views
Entering edit mode
3.8 years ago

Your workaround is actually the correct way to do this. When you read in the data, it is read as a tibble (a fancy data.frame). In order to extract the values in a column, you need to reference the column with either a dollar sign, or double square brackets. If you wanted to literally pass pos to the getBM function, you would first need to define that variable pos <- testData$pos. However, you can take a little shortcut since the values argument takes a list of vectors.


testData <- testData %>%
  rename("start" = pos) %>%
  mutate("end" = start) %>%

This gives you the list of values you need as input.

> testData
[1] "rs1467475747" "rs1378018226" "rs546813474"  "rs1175049916" "rs1187272067"
[6] "rs1427441701" "rs201635470"  "rs1483428031" "rs1251102826"

[1] 8 8 8 8 8 8 8 8 8

[1] 148357 148383 148402 148522 148523 148553 148556 148608 148610

[1] 148357 148383 148402 148522 148523 148553 148556 148608 148610

You can now just pass testData to the values argument.

  attributes=c("refsnp_id", "chr_name", "chrom_start", "chrom_end", 
                      "ensembl_gene_stable_id", "associated_gene"), 
  filters=c("snp_filter", "chr_name", "start", "end"), 
  values=testData,  mart=grch37.snp, uniqueRows=TRUE)

Login before adding your answer.

Traffic: 1030 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6