Here is a fairly efficient way to do this; assuming hg38
and BEDOPS and standard Unix tools installed.
$ bedmap --echo --echo-map-id --delim '\t' <(awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed -) <(wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed -) > answer.bed
This gets around making and storing intermediate files. The output BED file will use the UCSC-favored chromosome naming scheme.
If you instead want to set up and investigate intermediate files to see how this works:
$ awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed - > sumstats.bed
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed - > snp150.bed
The sumstats.bed
file will look something like:
$ echo "1:100000012_G_T" | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'
chr1 100000012 100000013 G/T
We add chr
as a prefix to the chromosome name, so that mapping can be done on the UCSC dbSNP data.
Finally:
$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed snp150.bed > answer.bed
The wget
statement could take a while; it is 7GB compressed. Grab some coffee!
Update
If your reference genome is hg19
, and if your sumstats.txt
file has a header and is 1-based, you may need to make some changes.
sumstats header and 1-based indexing
If the coordinates in your sumstats.txt
file use 1-based indexing, and if the sumstats.txt
file contains a header, then you will need to make a change to the awk
statement I provide above:
$ tail -n+2 sumstats.txt | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}' - | sort-bed - > sumstats.bed
dbSNP v150 for hg19
You could download the following VCF file of SNPs and convert it to BED via vcf2bed
:
$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz | vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - > hg19.dbSNP150.bed
It may be preferable to download the dbSNP file directly from NCBI as a VCF file. This will eliminate uncertainty about indexing of coordinates in UCSC's table.
Mapping
Then run the bedmap
command to map SNPs to sumstat elements:
$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed hg19.dbSNP150.bed > answer.bed
Hello,
is Allel 1 always the REF or can this also be a variant?
Fin swimmer
Hi Fin swimmer,
Yes, Allele 1 is always the REF allele.
There are also separate columns for each allele in the sumstats file. The Allele1 column refers to the reference allele and this corresponds the last allele quoted in the MarkerName column (col 1):