Question

Obtaining rsIDs using BiomaRt or linux

0

Entering edit mode

5.8 years ago

camerond ▴ 190

I have a list of 390 SNPs that I need to retrieve rsIDs for on the hg19 build.

My SNP file is in the following bed format (chromosome start stop).

chr1    207679307   207679307
chr1    207684192   207684192   
chr1    207685786   207685786   
chr1    207685965   207685965   
chr1    207692049   207692049

BiomaRt has been recommended for this and I have managed to retrieve rsIDs for individual loci using this code:

snp_mart = useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")

test <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), filters = c('chr_name','start','end'), values = list(7,37844263,37844263), mart = snpmart)

However, I'm not sure of the best way to quickly run all 390 SNPs through this code. Is this possible?

I suspect this may not be the best way to retrieve rsIDs, I have searched for other posts on this i.e see here but ideally I'm looking for a solution in R or linux.

Many Thanks

R Biomart rsID SNPs bed • 8.9k views

ADD COMMENT • link updated 2.2 years ago by Nance • 0 • written 5.8 years ago by camerond ▴ 190

3

Entering edit mode

Your list is not in BED format. BED is 0-based, meaning that start and end coordinate cannot be identical. If, say on a genome browser you look at a single bp, e.g. chr14:14000, then the BED entry must be chr14-13999-14000. Be sure to reformat correctly in order to retrieve the correct data!

ADD REPLY • link 5.8 years ago by ATpoint 81k

1

Entering edit mode

@OP: I think you don't need to use biomart at all. Format your data to bed format as ATpoint above suggested and then intersect with bedtools and latest dbSNP vcf data.. It should give you rsIDs and additional information from dbSNP.

ADD REPLY • link 5.8 years ago by cpad0112 21k

1

Entering edit mode

Why don't you save all the 390 SNPs in an object (will a data frame work?), and direct values = to that?

ADD REPLY • link 5.8 years ago by Benn 8.3k

4

Entering edit mode

5.8 years ago

cpad0112 21k

with OP data (takes some time):

> coords=read.csv("coord.txt", stringsAsFactors = F, strip.white = T, header = F, sep="\t")
> library(biomaRt)
> mart=useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")
> coords$V1=gsub("chr","",coords$V1)
> do.call(rbind,(apply(coords,1, function (x) getBM(attributes = c('refsnp_id','chrom_start','chrom_end', 'chrom_strand','allele'), filters = c('chr_name','start','end'), values = as.list(x), mart = mart))))

output:

        refsnp_id chrom_start chrom_end chrom_strand        allele
1   rs4844600   207679307 207679307            1         A/C/G
2  rs12037841   207684192 207684192            1           T/G
3   rs4266886   207685786 207685786            1           T/C
4   rs4562624   207685965 207685965            1           A/C
5   rs6656401   207692049 207692049            1           A/G
6    CR120974   207692049 207692049            1 HGMD_MUTATION
7 rs386638846   207692049 207692051            1       ATC/GTT
>

you can also use a loop as follows in lines of Dan:

code:

> coords=read.csv("coord.txt", stringsAsFactors = F, strip.white = T, header = F, sep="\t")
> library(biomaRt)
> mart=useMart(biomart="ENSEMBL_MART_SNP", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")
> coords$V1=gsub("chr","",coords$V1)
> results=data.frame(matrix(ncol=5,nrow=0))
> for (i in seq (1,nrow(coords))){
    ens=getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), filters = c('chr_name','start','end'), values = as.list(coords[i,]), mart = mart)
    results=rbind(results,ens)}

output:

> results
    refsnp_id        allele chrom_start chrom_strand
1   rs4844600         A/C/G   207679307            1
2  rs12037841           T/G   207684192            1
3   rs4266886           T/C   207685786            1
4   rs4562624           A/C   207685965            1
5   rs6656401           A/G   207692049            1
6    CR120974 HGMD_MUTATION   207692049            1
7 rs386638846       ATC/GTT   207692049            1

ADD COMMENT • link 5.8 years ago by cpad0112 21k

3

Entering edit mode

You can improve the speed of this approach by noting that you can use a vector of values if you use the chromosomal_region filter. This requires that you supply the SNP coordinates in the form chrom:start:end. You then don't need a loop or the do.call expression, just a single call to getBM(). This is much faster as there's only a single query transmitted, which is the primary bottleneck

To demonstrate, for me the do.call example takes:

system.time(do.call(rbind,(apply(input,1, function (x) getBM(attributes = c('refsnp_id','chrom_start','chrom_end', 'chrom_strand','allele'), filters = c('chr_name','start','end'), values = as.list(x), mart = snp_mart)))))

 user  system elapsed 
0.221   0.009   1.723

Now transform the data and submit again:

coords <- apply(coords, 1, paste, collapse = ":")
system.time(getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), 
   filters = 'chromosomal_region', 
   values = coords, 
   mart = snp_mart))

 user  system elapsed 
0.037   0.000   0.175

ADD REPLY • link 5.8 years ago by Mike Smith ★ 2.0k

0

Entering edit mode

Thanks @Mike. Next time will use that filter. I think this is the best biomart solution for OP request.

ADD REPLY • link 5.8 years ago by cpad0112 21k

0

Entering edit mode

Mike's code worked like a charm for me :) Thanks, Mike!!

ADD REPLY • link 3.4 years ago by rodd ▴ 230

2

Entering edit mode

@OP: As ATpoint pointed, file above is not 0 based, probably 1 based. You do not need biomart package. With tabix and parallel (from ubuntu 18.04 CLI), you can get the information you need:

output:

$ awk '{gsub("chr","",$1)} {print $1":"$2"-"$3}' coord.txt | parallel  --delay 1 tabix ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz {} | awk -v OFS="\t" 'BEGIN {print "chromosome","Coordinate","rsID","Ref","Alt"} {print "chr"$1,$2,$3,$4,$5}' 
chromosome  Coordinate  rsID    Ref Alt

chr1    207685965   rs4562624   A   C,T
chr1    207684192   rs12037841  T   G
chr1    207685786   rs4266886   T   C
chr1    207679307   rs4844600   A   C,G
chr1    207692048   rs386638846 CATC    CGTT
chr1    207692049   rs6656401   A   G,T

input:

$ cat coord.txt 
chr1    207679307   207679307
chr1    207684192   207684192
chr1    207685786   207685786
chr1    207685965   207685965
chr1    207692049   207692049

Note:

For chr1:207692049-207692049, you would get two rsID as RSPOS for rs386638846 is 207692049.
Tabix and parallel are available linux repos (ubuntu), conda channels.

ADD REPLY • link 5.8 years ago by cpad0112 21k

2

Entering edit mode

5.8 years ago

Pierre Lindenbaum 161k

using mysql ucsc:

$  awk 'BEGIN{printf("select chrom,chromStart+1 as chromStart ,chromEnd,name from snp150 where ");} {printf("%s ( chrom=\"%s\" and chromStart=%d and chromEnd=%d) ",(NR==1?"":" OR "),$1,int($2)-1,$3);} END{printf(";\n");}'  input.tsv |\
 mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg19

chrom  chromStart  chromEnd   name
chr1   207679307   207679307  rs4844600
chr1   207684192   207684192  rs12037841
chr1   207685786   207685786  rs4266886
chr1   207685965   207685965  rs4562624
chr1   207692049   207692049  rs6656401
chr1   207692049   207692049  rs111181074

ADD COMMENT • link 5.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi all, would you recommend these top voted suggestions to be the best way of resolving this same query, but for ~ 8 million SNP names/rsids? My original post is how to query rsid's using dbsnp in r or linux. Many thanks!

ADD REPLY • link 2.2 years ago by Nance • 0

1

Entering edit mode

download the dbsnp as VCF from NCBI, sort on ID column, query using join

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you - do both files need to be in vcf format for this? I have one .txt.gz file and one vcf.gz file and I am struggling to do this.

The .txt file is very standard format:

Chr     Pos     rsID    A0      A1      Beta-A1 P            
chr1    11888   NA      T       C       -0.109  0.81     
chr1    11213   NA      T       C       0.215   0.46    
chr1    11234   NA      T       C       -0.135  0.8      
chr1    12567   NA      C       T       0.177   0.77   
chr1    13333   NA      G       A       -0.165  0.81

And the .vcf is from dbsnp latest release https://ftp.ncbi.nih.gov/snp/latest_release/VCF/

##fileformat=VCFv4.2
##fileDate=20210513
##source=dbSNP
##dbSNP_BUILD_ID=155
##reference=GRCh38.p13
##phasing=partial

And then more vcf lines until

#CHROM          POS     ID              REF          ALT     
NC_000001.11    10001   rs1570391677    T           A,C       
NC_000001.11    10002   rs1570391692    A           C          
NC_000001.11    10003   rs1570391694    A           C

So I would like to use columns chr / pos / ref allele / alt allele to 'fill in' or 'impute' the RSIDS in to the .txt file above.

Is join still the most suitable way?

Thank you!

ADD REPLY • link 2.2 years ago by Nance • 0

score 2 · Accepted Answer · 2018-06-27

2

Entering edit mode

5.8 years ago

dan_adams ▴ 20

This will do (rough and ready):

results <- c()
# Initialise storage vector

for (snp in 1:dim(df)[1]) { 
#df is the object name of the SNP file

chrom <- gsub(pattern = 'chr', replacement = '', x = df[snp, 1]) 
# Remove 'chr' for searching

temp <- getBM(attributes = c('refsnp_id', 'allele', 'chrom_start', 'chrom_strand'), filters = c('chr_name', 'start', 'end'), values = list(chrom, df[snp, 2] , df[snp, 3]), mart = snp_mart)

temp[, 5] <- snp 
# Store SNP file index row next to each answer

results <- rbind(results, temp)

}

Not the cleanest but should work!

EDIT: might make more sense to change temp[, 5] <- snp to store the actual SNP coordinate rather than index, so i.e. df[snp, 2], for example

ADD COMMENT • link 5.8 years ago by dan_adams ▴ 20

1

Entering edit mode

Hi @dan_adams

Thanks for your response. This suggestion ran for a bit (it takes a while) but after 28 SNPs it threw the following error:

Error in `[<-.data.frame`(`*tmp*`, , 5, value = 27L) : 
replacement has 1 row, data has 0

It's obviously choking as the SNP does not correspond to an rsID in the database so I've added an if statement to get round this:

if (dim(temp)[1] == 1) {

results <- rbind(results, temp)

} else {

print(paste("SNP", ADSNPs[snp, 2], "not found"))

}

ADD REPLY • link 5.8 years ago by camerond ▴ 190

1

Entering edit mode

Hellodan_adams,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.

code_formatting

Thank you!

ADD REPLY • link 5.8 years ago by finswimmer 16k