Get rsID for a list of SNPs in an entire GWAS sumstats file
3
2
Entering edit mode
6.6 years ago
camerond ▴ 190

I have downloaded a GWAS sumstats file and have a list of SNP locations.

The information for each SNP location is condensed in column 1 of the sumstats file in the following format:

    1:100000012_G_T
  • 1 = Chromosome
  • 100000012 = SNP location
  • T = Allele 1
  • G = Allele 2

I'm trying to run an LD Score analysis on this sumstats file, but I first need merge these SNPs with SNPs in the hapmap3 SNP list that has SNP IDs in the rs number format (obvs incompatible with the info in the sumstats file).

I'm looking for the most efficient way to obtain rsIDs for all the SNPs I have in the sumstats file (munging the file itself is not an issue).

Is there an R/bash method/package/command that can do this (I've checked out bed2vcf in R but don't think I have enough information for this to work)?

Or, alternatively, is there a way to generate a VCF file from the information I have here? I've looked into bcftools annotate to generate my rs IDs. I have already downloaded hg19 dbSNP VCF file that contains all the rsIDs and associated info, but not sure how to generate the second VCF file (from my sumstats SNP ID info) to run this command.

next-gen sequencing GWAS sumstats • 9.9k views
ADD COMMENT
0
Entering edit mode

Hello,

is Allel 1 always the REF or can this also be a variant?

Fin swimmer

ADD REPLY
0
Entering edit mode

Hi Fin swimmer,

Yes, Allele 1 is always the REF allele.

There are also separate columns for each allele in the sumstats file. The Allele1 column refers to the reference allele and this corresponds the last allele quoted in the MarkerName column (col 1):

MarkerName  Allele1 Allele2 Effect  StdErr  P.value Direction
1:100000012_G_T t   g   -0.0355 0.019   0.06185 +--
ADD REPLY
7
Entering edit mode
6.6 years ago

Hello,

here is a way which can work. First create a python3 script called sumstat2vcf.py with this code:

import re
import sys

for line in sys.stdin.readlines():
    data = re.split(':|_', line.strip())
    print(data[0], data[1], ".", data[2], data[3], ".", ".", ".", sep="\t")

It expect a list of the location in the format you give above. So we can use cut on your sumstat file to get the first column. If the first line of the file decscribes the columns we have to use also tail to get rid of it. The result can be streamed directly to the python script to create a vcf file:

cut -f1 sumstat.txt|tail -n+2|python sumstat2vcf.py|sort -k1,1V -k2,2g > sumstat.vcf

For annotating I prefer snpSift. To snpSift we can also pipe directly:

cut -f1 sumstat.txt|tail -n+2|python sumstat2vcf.py|sort -k1,1V -k2,2g|java -jar SnpSift.jar annotate -id dbSNP.vcf.gz > sumstat_annotated.vcf

If you now just wants a list of rs IDs we can go on using cut and grep:

cut -f1 sumstat.txt|tail -n+2|python sumstat2vcf.py|sort -k1,1V -k2,2g|java -jar SnpSift.jar annotate -id dbSNP.vcf.gz|cut -f3|grep -e "^rs" > rsId_list.txt

Good luck :)

fin swimmer

EDIT:

Based on Alex' answer you can use awk instead of the python script. Just replace python sumstat2vcf.py with

awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t.\t"a[3]"\t"a[4]"\t.\t.\t."}'

EDIT2:

Tidied up the python code a little bit.

ADD COMMENT
1
Entering edit mode

Hi finswimmer,

Finally got it working and accepted your answer. Thanks for your help. The final issue I was having was with the zipped dbSNP file. When trying to index with tabix it was throwing an error. Zipping had to be done with bgzip, I'd used something else to zip the file initially.

The final commands I used were:

dbSNP vcf file for hg19 downloaded from here.

Then, to prep the dbSNP vcf file for snpSift:

bgzip 00-All.vcf.gz

tabix -p vcf 00-All.vcf.gz

And finally:

cut -f1 sumstat.txt | tail -n+2 | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t.\t"a[3]"\t"a[4]"\t.\t.\t."}'  | sort -k1,1V -k2,2g | java -jar SnpSift.jar annotate -id 00-All.vcf.gz > sumstat_annotated.vcf
ADD REPLY
1
Entering edit mode
6.6 years ago

Here is a fairly efficient way to do this; assuming hg38 and BEDOPS and standard Unix tools installed.

$ bedmap --echo --echo-map-id --delim '\t' <(awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed -) <(wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed -) > answer.bed

This gets around making and storing intermediate files. The output BED file will use the UCSC-favored chromosome naming scheme.

If you instead want to set up and investigate intermediate files to see how this works:

$ awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | sort-bed - > sumstats.bed
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp150.txt.gz | gunzip -c | cut -f2-5 | sort-bed - > snp150.bed

The sumstats.bed file will look something like:

$ echo "1:100000012_G_T" | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'
chr1    100000012   100000013   G/T

We add chr as a prefix to the chromosome name, so that mapping can be done on the UCSC dbSNP data.

Finally:

$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed snp150.bed > answer.bed

The wget statement could take a while; it is 7GB compressed. Grab some coffee!


Update

If your reference genome is hg19, and if your sumstats.txt file has a header and is 1-based, you may need to make some changes.

sumstats header and 1-based indexing

If the coordinates in your sumstats.txt file use 1-based indexing, and if the sumstats.txt file contains a header, then you will need to make a change to the awk statement I provide above:

$ tail -n+2 sumstats.txt | awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}' - | sort-bed - > sumstats.bed

dbSNP v150 for hg19

You could download the following VCF file of SNPs and convert it to BED via vcf2bed:

$ wget -qO- ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz | vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - > hg19.dbSNP150.bed

It may be preferable to download the dbSNP file directly from NCBI as a VCF file. This will eliminate uncertainty about indexing of coordinates in UCSC's table.

Mapping

Then run the bedmap command to map SNPs to sumstat elements:

$ bedmap --echo --echo-map-id --delim '\t' sumstats.bed hg19.dbSNP150.bed > answer.bed
ADD COMMENT
0
Entering edit mode
awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt

I was sure there is a awk way to split a text :)

Thanks for showing.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi Alex,

Thanks for the response. Just getting round to this now as I was off for Easter.

I had a couple of minor issues with the sort-bed command.

Firstly it didn't like the header:

Non-numeric start coordinate.  See line 1 in answer.bed.
(remember that chromosome names should not contain spaces.)

So I removed this: tail -n+2 answer.bed > test.bed

I'm actually using hg37, so when I downloaded this (it did take a while!) bedops threw the following error, Error on line 6 in -. Genomic end coordinate is less than (or equal to) start coordinate. which, perhaps unsurprisingly, was due to some end coordinates matching the start coordinates in the UCSC SNP file.

I removed these using awk '{if ($2 != $3) print $0}' snp150.bed > snp150.MatchesRemoved.bed

Don't suppose you know if its valid or not to remove these?? A typical entry from the downloaded SNP file is:

585 chr1    10055   10055   rs768019142 0   +   -   -   -/A genomic insertion   unknown 0   0   near-gene-5 betweenSSMP,    0

In the original file there were ~234M SNP entries and I removed 7M of them.

However, I have a more pressing concern as when I run the bedmap command it only returns rsIDs for 1.5M SNPs. My original sumstats file contained 9.5M SNPs, this seems extremely low to me. The sumstat file was definately generated from and alignment with hg37 (or hg19) so I'm finding it hard to understand why so few SNPs in the sumstat file have been mapped.

I downloaded the UCSC SNPs from here.

Thanks.

ADD REPLY
0
Entering edit mode

Hello camerond,

the SNP file you've downloaded is neither a valid bed file nor a valid vcf file. It seems to be a database dump from UCSC. Do you realy need this file? The current dbSNP file in vcf format for hg19 is located here.

The next thing is, are the positions in the GWAS sumstats 0-based or 1-based? If the later one, than you have to change the awk command if you want to have a bed file. Because bed files are 0-based.

 awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}'

fin swimmer

ADD REPLY
0
Entering edit mode

Please see the updated answer. Hopefully this addresses your questions.

ADD REPLY
0
Entering edit mode

If I'm not mistaken, SNP positions are usually reported using 1-based coordinates, while the ucsc bed files are 0-based, so this will create an off-by-one error. This could be corrected by modifying the awk command to subtract 1 from the start position rather than adding one to the end position:

awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}' sumstats.txt
ADD REPLY
0
Entering edit mode

Have a look at my post 2h before yours ;)

ADD REPLY
0
Entering edit mode

D'oh! I guess I should of read through all of the comments before posting

ADD REPLY
0
Entering edit mode

Also, is this approach able to handle cases where multiple variants map to the same coordinates?

ADD REPLY
0
Entering edit mode

Yes, this will map multiple records correctly, so long as the inputs are sorted per sort-bed. Use of vcf2bed uses sort-bed behind the scenes.

ADD REPLY
0
Entering edit mode

Hi Alex,

Thanks for the update. I still can't get this working unfortunately.

I altered my sumstats file using both awk commands, as I'm unsure if the sumstats file is 0 or one based, and ran them separately.

Full header of sumstats pre-awk:

MarkerName  Allele1 Allele2 Effect  StdErr  P.value Direction   HetISq  HetChiSq    HetDf   HetPVal Pval_IBDseq Pval_IIBDGC Pval_GWAS3  Min_single_cohort_pval
1:100000012_G_T t   g   -0.0355 0.019   0.06185 +-- 83.7    12.236  2   0.002202    0.216638    0.8872  0.000166406 0.000166406
1:10000006_G_A  a   g   -0.0385 0.1421  0.7864  +-- 54.9    4.436   2   0.1088  0.0947142   0.7692  0.201597    0.0947142
1:100000827_C_T t   c   -0.0456 0.0176  0.009574    +-- 76  8.338   2   0.01546 0.504605    0.4085  0.000190761 0.000190761

Header of bed file using 1-based awk command awk:

'{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]-1"\t"a[2]"\t"a[3]"/"a[4];}'

chr1    714276  714277  C/A a   c   -0.0542 0.1375  0.6936  --+ 35  3.076   2   0.2148  0.72696 0.08271 0.754039    0.08271
chr1    731717  731718  T/C t   c   -0.0583 0.0302  0.0541  --- 0   0.591   2   0.744   0.638835    0.08543 0.289984    0.08543
chr1    732808  732809  T/C t   c   0.0384  0.0386  0.3206  +-+ 45  3.637   2   0.1623  0.584316    0.04167 0.673273    0.04167

Header of bed file using 0-based awk command:

awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'

chr1    714277  714278  C/A a   c   -0.0542 0.1375  0.6936  --+ 35  3.076   2   0.2148  0.72696 0.08271 0.754039    0.08271
chr1    731718  731719  T/C t   c   -0.0583 0.0302  0.0541  --- 0   0.591   2   0.744   0.638835    0.08543 0.289984    0.08543
chr1    732809  732810  T/C t   c   0.0384  0.0386  0.3206  +-+ 45  3.637   2   0.1623  0.584316    0.04167 0.673273    0.04167

Header of snp.vcf file

1   10018   10019   rs775809821 .   TA  T   .   RS=775809821;RSPOS=10020;dbSNPBuildID=144;SSR=0;SAO=0;VP=0x050000020005000002000200;GENEINFO=DDX11L1:100287102;WGT=1;VC=DIV;R5;ASP
1   10038   10039   rs978760828 .   A   C   .   RS=978760828;RSPOS=10039;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP
1   10042   10043   rs1008829651    .   T   A   .   RS=1008829651;RSPOS=10043;dbSNPBuildID=150;SSR=0;SAO=0;VP=0x050000020005000002000100;GENEINFO=DDX11L1:100287102;WGT=1;VC=SNV;R5;ASP

The issue I have is that after running the bedmap command on either sumstats.bed file i.e.:

bedmap --echo --echo-map-id --delim '\t' sumstats_zeroBased.bed hg19.dbSNP150.bed > answer_zeroBased.bed

No rsIDs are included in the answer.bed file. I am returned a 9,570787 by 18 file that is identical to the sumstats.bed input file.

I used the following commands:

awk '{n=split($0,a,/[:_]/); print "chr"a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}' sumstats.txt | tail -n+2 | sort-bed - > sumstats_zeroBased.bed

bedmap --echo --echo-map-id --delim '\t' sumstats_zeroBased.bed hg19.dbSNP150.bed > answer_zeroBased.bed

The snp list vcf file I'm using is not the one you recommend:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/All_20170710.vcf.gz

but the one @finswimmer recommended:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/00-All.vcf.gz

I'm sure this is not causing the issue as they are from the same site, just different versions, right? These files are 56Gs so didn't want to download a second unnecessarily.

ADD REPLY
0
Entering edit mode

Hello camerond,

the link to dbSnp I gave you is just a symlink to the one Alex gave you. So it should be the same file.

As i don't use bedmap I can help you with Alex' solution. But have you already tried mine (the version with awk)? With the example you gave here it works for me assume the position given are 1-based. I edited my answer again a little bit, by adding a sort step.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi finswimmer,

I have tried your suggestion (pre-update), Alex's and several other methods (including bcftools). The issue I was having with your suggestion is with java. I'd never used Java before, and had issues getting snpSift to work.

Tbh I didn't give it a huge amount of time, as I was more familiar with the packages that Alex suggested. I'll have another go with snpSift now.

Thanks.

P.S. How can you tell definitively if the positions in the snp-file are 1-based or 0-based?

ADD REPLY
0
Entering edit mode

P.S. How can you tell definitively if the positions in the snp-file are 1-based or 0-based?

Maybe it's not 'definitively' enough, but I tried your example. With 1-based all variants get a rs-ID with 0-based none get one. So I guess that's a strong hint.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi Alex,

I have been trying to convert the vcf file All_20170710.vcf to bed format for build 37 using the following command:

./bin/vcf2bed --sort-tmpdir=${PWD} --max-mem=2G - < All_20170710.vcf > hg19.dbSNP150.bed

Everytime, I get a warning that the command "convert2bed not found". However, this binary is present in the bin.

Any suggestion, where am I going wrong?

ADD REPLY
0
Entering edit mode

Use the installation instructions and either put the binaries in /usr/local/bin or modify your shell environment's PATH variable: https://bedops.readthedocs.io/en/latest/content/installation.html#linux

Another option is to use your favorite package installer, if it has BEDOPS available.

ADD REPLY
1
Entering edit mode

Thank you. Adding to shell enviroment's PATH did work.

ADD REPLY
0
Entering edit mode
6.6 years ago
$ echo "20:62963170_C_T" | sed 's/^/chr/g;s/[:_]/\t/g' | awk -v OFS="\t" '{print $1,$2,$2}' | bedtools intersect -a stdin -b dbsnp.138.chr20.vcf -wa -wb

chr20   62963170    62963170    chr20   62963170    rs62218008  C   T   .   PASS    OTHERKG;RS=62218008;RSPOS=62963170;SAO=0;SSR=0;VC=SNV;VP=0x050000000001000002000100;WGT=1;dbSNPBuildID=129

extract columns of interest.

ADD COMMENT

Login before adding your answer.

Traffic: 851 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6