How to get SNP identifiers from VCF file?
4
2
Entering edit mode
3.2 years ago
O.rka ▴ 290

I grabbed the first couple of columns from my whole genome VCF file. It looks like this. Are there any tools either web-based, Python, or R that I could use to get SNP identifiers (e.g. rs429358 or rs7412) for all of my SNPs that are in a particular database? I'm very new to working with VCF files and I want to figure out my blood-type. I'm very comfortable coding in both Python and R if there are any packages available for these languages. I would like to avoid, if possible, depositing my sequences in a 3rd party that would potentially use my information for their own gains but I am not opposed to having them as references in case I can't figure out any other options.

#CHROM  POS ID  REF ALT QUAL
chrM    64  .   C   T   3070.00
chrM    73  .   A   G   3070.00
chrM    146 .   T   C   3070.00
chrM    153 .   A   G   3070.00
chrM    263 .   A   G   3070.00
chrM    310 .   T   C   3070.00
chrM    513 .   GCA G   3070.00
chrM    663 .   A   G   3070.00
chrM    750 .   A   G   3070.00
chrM    1438    .   A   G   3070.00
chrM    1598    .   G   A   3070.00
chrM    1736    .   A   G   3070.00
chrM    1888    .   G   A   3070.00
chrM    2706    .   A   G   3070.00
chrM    3106    .   CN  C   3070.00
chrM    4248    .   T   C   3070.00
chrM    4769    .   A   G   3070.00
chrM    4824    .   A   G   3070.00
chrM    7028    .   C   T   3070.00
chrM    8027    .   G   A   3070.00
chrM    8794    .   C   T   3070.00
chrM    8860    .   A   G   3070.00
chrM    11719   .   G   A   3070.00

SNP vcf genome database clinical • 4.9k views
5
Entering edit mode
3.2 years ago

Make a BED file of SNPs for your genome of interest, e.g. hg38 and dbSNP v147:

$LC_ALL=C$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \
| gunzip -c \
| awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \
| sort-bed - \
> hg38.snp147.bed


Convert your VCF file to BED and map the ID field from the SNP file:

$bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=$PWD < variants.vcf) hg38.snp147.bed > answer.bed


The rs* ID for the variant will be in the last column of answer.bed.

0
Entering edit mode

I first ran this: LC_ALL=C wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/snp147.txt.gz \ | gunzip -c \ | awk -v OFS="\t" '{ print $2,$3,($3+1),$5 }' \ | sort-bed - \

hg38.snp147.bed

and then ran my non-trimmed (all the columns) vcf file genome.vcf:

bedmap --echo --echo-map-id --delim '\t' <(vcf2bed --sort-tmpdir=./jespinoz/ < genome.vcf) hg38.snp147.bed > genome.vcf.hg38.snp147.bed


but none of the columns contain the rs* IDs

-bash-4.1$grep "rs" genome.vcf.hg38.snp147.bed -bash-4.1$


Do you know what could be happening?

0
Entering edit mode

Can you show the top ten or twenty lines of your genome.vcf? I just want to verify that conversion will work. If your file isn't in VCF format, we could look at using other tools to convert it to BED so that you can map IDs to positions.

0
Entering edit mode

I ended up running vcf2bed and then: bedmap --echo --echo-map-id --delim '\t' hg38.snp147.bed genome.bed > genome.vcf.hg38.snp147.bed and it worked. Do you know of any way I can map the rs* IDs to their descriptive names?

0
Entering edit mode

You could grep the rs IDs against a text file containing descriptive names.

2
Entering edit mode
3.2 years ago

Hello,

what you are looking for is called "Annotating". There a couple of programs that can use other vcf/bed,/tab-delimited files which contains information that you want to see in your vcf.

bcftools and snpEff/SnpSift are such programs .

Than run e.g. SnpSift:

java -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf


fin swimmer

0
Entering edit mode

I'm trying it out with SnpSift right now. Do you have any examples of its use with bcftools?

0
Entering edit mode

I ran it on the grid and got the following error:

qsub -cwd -V -N sift "SnpSift annotate -id 00-All.vcf genome.vcf > genome.dbSNP.hg38.annotated.vcf"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3284)
at org.snpsift.annotate.VcfIndexDataChromo.grow(VcfIndexDataChromo.java:101)
at org.snpsift.annotate.VcfIndex.index(VcfIndex.java:183)
at org.snpsift.annotate.DbVcfSorted.open(DbVcfSorted.java:55)
at org.snpsift.annotate.AnnotateVcfDb.open(AnnotateVcfDb.java:395)
at org.snpsift.SnpSiftCmdAnnotate.annotateInit(SnpSiftCmdAnnotate.java:190)
at org.snpsift.SnpSiftCmdAnnotate.annotate(SnpSiftCmdAnnotate.java:70)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:410)
at org.snpsift.SnpSiftCmdAnnotate.run(SnpSiftCmdAnnotate.java:397)
at org.snpsift.SnpSift.run(SnpSift.java:588)
at org.snpsift.SnpSift.main(SnpSift.java:76)

0
Entering edit mode

It looks like you are running out of memory. Try to limit the memory java can use by setting the -Xmx paramater. E.g.

java -Xmx12G -jar SnpSift.jar annotate -id dbSnp.vcf your.vcf > annotated.vcf


if you want to allow java to take 12GB.

BTW: You should index the gzip'ed dbSnp file with tabix and use the .gz file in the command above. Or you can download the index file directly: hg19/hg38

fin swimmer

0
Entering edit mode
3.2 years ago
Denise CS ★ 5.1k

Use the Ensembl Variant Effect Predictor. If your variant is known in the public database e.g. dbSNP you will get its rsID as VEP has an output option for co-located variants. VEP accepts VCF as an input file and you can run it through the Ensembl REST API.

0
Entering edit mode

Do you know if this takes the truncated VCF that I've previewed above? or full VCFs only?

0
Entering edit mode

It errored when I tried both the full and trimmed VCFs

0
Entering edit mode
3.2 years ago
steve ★ 2.9k

I typically use ANNOVAR for annotating data like this, you can specify dbSNP output if you've installed the corresponding reference database

0
Entering edit mode

Do you have any example commands you would use?

0
Entering edit mode

yes I recently made a script for this here. The installation commands are here