Loop to retrieve rsID out of chrID list
1
0
Entering edit mode
3.8 years ago
aroso491 • 0

Hello, I have a GWAS summary statistics file where some of the rows are not rsID but chrID. I want to use bioMart package in R in order to find all the rsIDs for my 232 chrIDs in this format:

chr1_11105545 1 11105545 T C 0.965395 0.72017 -7.40832e-03 0.00460502 0.113547560

In bold the columns corresponding to chromosome (CHR) and position (POS). I know and have tested that it works if I use "values = list(chr_number, pos_number start, pos_number end)", but given that the size is 232, I am trying to write a loop to get all the list done.

I have done this but when I run it I get the error:

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300001 milliseconds with 15732557 bytes received

Which I think it is due to the fact that I'm somehow making an infinite loop or something of the likes of it? I've been at it for three days now though and I am a bit confused so I'd appreciate if someone helped me find the error or hinted me in the right direction!

#Extract ch_# rows in R dataframe (yields 232 hits)
SSNP_M_Ch <- subset(SSNP_M, grepl("chr\\d_", SNP))

snp = useMart("ENSEMBL_MART_SNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")

results<-c()
for (i in 1:dim(SSNP_M_Ch)[1]){
  temp <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'),
                filters = c('chr_name','start','end'), 
                values = list(SSNP_M_Ch$CHR,SSNP_M_Ch$POS,SSNP_M_Ch$POS), mart = snp)  
  temp[] <- i
  results <- rbind(results,temp)
}

Thanks!

SNP R genome snp biomart • 2.8k views
ADD COMMENT
1
Entering edit mode

At least, you can easily check whether there is some progress adding 'print(head(temp))' after getBM query. Hovewer, in such a loop I would expect some subsetting like SSNP_M_Ch$CHR[i] aimed to feed getBM with a single current value. Presented code looks like whole CHR and POS columns of SSNP_M_Ch are passed to getBM.

Also, as far as I can understand, 'temp[] <- i' fills a dataframe after getBM query with a current index value - is it really something it meant to do?

How about this:

library(tidyverse)

results<- as.list(1:nrow(SSNP_M_Ch))

for (i in seq_along(results)){
  current_query = list(SSNP_M_Ch$CHR[i],
                       SSNP_M_Ch$POS[i],
                       SSNP_M_Ch$POS[i])
  print(current_query)

  results[[i]] <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'),
            filters = c('chr_name','start','end'), 
            values = current_query,
            mart = snp)
}
results = reduce(results, bind_rows)

By printing each current request, you'll be able to find if some id's cause troubles in getBM.

ADD REPLY
0
Entering edit mode

BTW, I do not know whether the list that you passed to getBW as 'values' argument is correct - repeating POS values twice looks counterintuitive to me, but I did not work with biomartr and can't advice on this matter.

ADD REPLY
0
Entering edit mode

Thanks! I got it sorted in the end, I didn't realize I was passing the whole column instead of going row by row... I ended up doing SSNP_M_Ch$CHR[i,2], SSNP_M_Ch$POS[i,3] and just added the index value as another column in temp and that worked smoothly.

ADD REPLY
2
Entering edit mode
3.8 years ago
Mike Smith ★ 2.0k

It looks like you got a working solution, but I just wanted to add that you don't need to do this in a loop at all. You can use the chromosomal_region filter and then give a vector of values where each element is the form chrom:start:end. This should be much quicker than looping over each location separately, and you'll get back a data.frame directly.

Here's an example with a couple of SNPs in GRCh38:

library(biomaRt)
## Use the default ENSEMBL Variation Mart & Human dataset
snp = useEnsembl(biomart = "snps", 
                 dataset = "hsapiens_snp")

## Create an example set of coordinates as a dataframe
SNP_M <- data.frame(CHR = c(1,1), START = c(10020, 10039), END = c(10020, 10039))

## Combine these into the format chr:start:end
## It's important to include the end even if it's a single base, 
## otherwise it searches to the end of the chromosome
coords <- apply(SNP_M, 1, paste, collapse = ":")
coords
#> [1] "1:10020:10020" "1:10039:10039"

## Submit the query
getBM(attributes = c('refsnp_id', 'chr_name', 'chrom_start', 'chrom_end', 'allele'),
      filters = c('chromosomal_region'), 
      values = coords, 
      mart = snp)  
#>     refsnp_id chr_name chrom_start chrom_end allele
#> 1 rs775809821        1       10020     10021   AA/A
#> 2 rs978760828        1       10039     10039    A/C
ADD COMMENT
0
Entering edit mode

This is much more elegant and simple, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2934 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6