Here is a fairly efficient way to do this; assuming hg38 and BEDOPS and standard Unix tools installed.
$ bedmap --echo --echo-map-id --delim '\t' <(awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed -) <(wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed -) > answer.bed
This gets around making and storing intermediate files. The output BED file will use the UCSC-favored chromosome naming scheme.
If you instead want to set up and investigate intermediate files to see how this works:
$ awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed - > sumstats.bed
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed - > snp150.bed
The sumstats.bed file will look something like:
$ echo "1:100000012_G_T" | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'
chr1 100000012 100000013 G/T
We add chr as a prefix to the chromosome name, so that mapping can be done on the UCSC dbSNP data.
Finally:
$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed snp150.bed > answer.bed
The wget statement could take a while; it is 7GB compressed. Grab some coffee!
Update
If your reference genome is hg19, and if your sumstats.txt file has a header and is 1-based, you may need to make some changes.
sumstats header and 1-based indexing
If the coordinates in your sumstats.txt file use 1-based indexing, and if the sumstats.txt file contains a header, then you will need to make a change to the awk statement I provide above:
$ tail -n+2 sumstats.txt | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}' - | sort-bed - > sumstats.bed
dbSNP v150 for hg19
You could download the following VCF file of SNPs and convert it to BED via vcf2bed:
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz | vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - > hg19.dbSNP150.bed
It may be preferable to download the dbSNP file directly from NCBI as a VCF file. This will eliminate uncertainty about indexing of coordinates in UCSC's table.
Mapping
Then run the bedmap command to map SNPs to sumstat elements:
$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed hg19.dbSNP150.bed > answer.bed
Hello,
is Allel 1 always the REF or can this also be a variant?
Fin swimmer
Hi Fin swimmer,
Yes, Allele 1 is always the REF allele.
There are also separate columns for each allele in the sumstats file. The Allele1 column refers to the reference allele and this corresponds the last allele quoted in the MarkerName column (col 1):