Question: How to get SNP identifiers from VCF file?
0
gravatar for O.rka
4 months ago by
O.rka40
O.rka40 wrote:

I grabbed the first couple of columns from my whole genome VCF file. It looks like this. Are there any tools either web-based, Python, or R that I could use to get SNP identifiers (e.g. rs429358 or rs7412) for all of my SNPs that are in a particular database? I'm very new to working with VCF files and I want to figure out my blood-type. I'm very comfortable coding in both Python and R if there are any packages available for these languages. I would like to avoid, if possible, depositing my sequences in a 3rd party that would potentially use my information for their own gains but I am not opposed to having them as references in case I can't figure out any other options.

#CHROM  POS ID  REF ALT QUAL
chrM    64  .   C   T   3070.00
chrM    73  .   A   G   3070.00
chrM    146 .   T   C   3070.00
chrM    153 .   A   G   3070.00
chrM    263 .   A   G   3070.00
chrM    310 .   T   C   3070.00
chrM    513 .   GCA G   3070.00
chrM    663 .   A   G   3070.00
chrM    750 .   A   G   3070.00
chrM    1438    .   A   G   3070.00
chrM    1598    .   G   A   3070.00
chrM    1736    .   A   G   3070.00
chrM    1888    .   G   A   3070.00
chrM    2706    .   A   G   3070.00
chrM    3106    .   CN  C   3070.00
chrM    4248    .   T   C   3070.00
chrM    4769    .   A   G   3070.00
chrM    4824    .   A   G   3070.00
chrM    7028    .   C   T   3070.00
chrM    8027    .   G   A   3070.00
chrM    8794    .   C   T   3070.00
chrM    8860    .   A   G   3070.00
chrM    11719   .   G   A   3070.00
clinical snp database genome vcf • 430 views
ADD COMMENTlink modified 3 months ago by Denise - Open Targets4.5k • written 4 months ago by O.rka40
4
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds24k
Seattle, WA USA
Alex Reynolds24k wrote:

Make a BED file of SNPs for your genome of interest, e.g. hg38 and dbSNP v147:

$ LC_ALL=C
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
   | gunzip -c \
   | awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \
   | sort-bed - \
   > hg38.snp147.bed

Convert your VCF file to BED and map the ID field from the SNP file:

$ bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=$PWD < variants.vcf) hg38.snp147.bed > answer.bed

The rs* ID for the variant will be in the last column of answer.bed.

ADD COMMENTlink written 4 months ago by Alex Reynolds24k

I first ran this: LC_ALL=C wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \ | gunzip -c \ | awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \ | sort-bed - \

hg38.snp147.bed

and then ran my non-trimmed (all the columns) vcf file genome.vcf:

bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=./jespinoz/ < genome.vcf) hg38.snp147.bed > genome.vcf.hg38.snp147.bed

but none of the columns contain the rs* IDs

-bash-4.1$ grep "rs" genome.vcf.hg38.snp147.bed
-bash-4.1$

Do you know what could be happening?

ADD REPLYlink modified 3 months ago • written 3 months ago by O.rka40

Can you show the top ten or twenty lines of your genome.vcf? I just want to verify that conversion will work. If your file isn't in VCF format, we could look at using other tools to convert it to BED so that you can map IDs to positions.

ADD REPLYlink written 3 months ago by Alex Reynolds24k

I ended up running vcf2bed and then: bedmap --echo --echo-map-id --delim '\t' hg38.snp147.bed genome.bed > genome.vcf.hg38.snp147.bed and it worked. Do you know of any way I can map the rs* IDs to their descriptive names?

ADD REPLYlink written 3 months ago by O.rka40

You could grep the rs IDs against a text file containing descriptive names.

ADD REPLYlink written 3 months ago by Alex Reynolds24k
1
gravatar for finswimmer
4 months ago by
finswimmer2.8k
Germany
finswimmer2.8k wrote:

Hello,

what you are looking for is called "Annotating". There a couple of programs that can use other vcf/bed,/tab-delimited files which contains information that you want to see in your vcf.

bcftools and snpEff/SnpSift are such programs .

So first download the dbnSnp database (hg19 / hg38)

Than run e.g. SnpSift:

java -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf

fin swimmer

ADD COMMENTlink modified 4 months ago • written 4 months ago by finswimmer2.8k

I'm trying it out with SnpSift right now. Do you have any examples of its use with bcftools?

ADD REPLYlink written 3 months ago by O.rka40

I ran it on the grid and got the following error:

qsub -cwd -V -N sift "SnpSift annotate -id 00-All.vcf genome.vcf > genome.dbSNP.hg38.annotated.vcf"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3284)
    at org.snpsift.annotate.VcfIndexDataChromo.grow(VcfIndexDataChromo.java:101)
    at org.snpsift.annotate.VcfIndexDataChromo.add(VcfIndexDataChromo.java:46)
    at org.snpsift.annotate.VcfIndex.add(VcfIndex.java:67)
    at org.snpsift.annotate.VcfIndex.loadIntervals(VcfIndex.java:245)
    at org.snpsift.annotate.VcfIndex.index(VcfIndex.java:183)
    at org.snpsift.annotate.DbVcfSorted.open(DbVcfSorted.java:55)
    at org.snpsift.annotate.AnnotateVcfDb.open(AnnotateVcfDb.java:395)
    at org.snpsift.SnpSiftCmdAnnotate.annotateInit(SnpSiftCmdAnnotate.java:190)
    at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:70)
    at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410)
    at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397)
    at org.snpsift.SnpSift.run(SnpSift.java:588)
    at org.snpsift.SnpSift.main(SnpSift.java:76)
ADD REPLYlink written 3 months ago by O.rka40

It looks like you are running out of memory. Try to limit the memory java can use by setting the -Xmx paramater. E.g.

java -Xmx12G -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf

if you want to allow java to take 12GB.

BTW: You should index the gzip'ed dbSnp file with tabix and use the .gz file in the command above. Or you can download the index file directly: hg19/hg38

fin swimmer

ADD REPLYlink written 3 months ago by finswimmer2.8k
0
gravatar for Denise - Open Targets
3 months ago by
UK, Hinxton, EMBL-EBI
Denise - Open Targets4.5k wrote:

Use the Ensembl Variant Effect Predictor. If your variant is known in the public database e.g. dbSNP you will get its rsID as VEP has an output option for co-located variants. VEP accepts VCF as an input file and you can run it through the Ensembl REST API.

ADD COMMENTlink written 3 months ago by Denise - Open Targets4.5k

Do you know if this takes the truncated VCF that I've previewed above? or full VCFs only?

ADD REPLYlink written 3 months ago by O.rka40

It errored when I tried both the full and trimmed VCFs

ADD REPLYlink written 3 months ago by O.rka40
0
gravatar for steve
3 months ago by
steve1.5k
United States
steve1.5k wrote:

I typically use ANNOVAR for annotating data like this, you can specify dbSNP output if you've installed the corresponding reference database

ADD COMMENTlink written 3 months ago by steve1.5k

Do you have any example commands you would use?

ADD REPLYlink written 3 months ago by O.rka40

yes I recently made a script for this here. The installation commands are here

ADD REPLYlink written 3 months ago by steve1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1156 users visited in the last hour