HGVS format to VCF from portal.gdc.cancer.gov
3
1
Entering edit mode
5.0 years ago
Srw ▴ 60

I'm trying to convert hundreds of variant positions found here to vcf for downstream analyses and cannot find a good way to do this. I found jannovar but that only take variants from .c (coding) and .n (non-coding) positions whereas portal.gdc.cancer.gov produces .g (genome) positions.

An example in hgvs format would be

17:g.7674180C>A
17:g.7675997G>T
17:g.7676257G>A
17:g.7676088G>C
17:g.7676215G>A
17:g.7676152delC
17:g.7676381C>A
17:g.7670712delG
17:g.7670716C>G
17:g.7676264_7676265insA

Any help would be much appreciated.

vcf hgvs genome snp • 2.4k views
ADD COMMENT
2
Entering edit mode
5.0 years ago
Srw ▴ 60

Well, after a bunch of wasted time VEP was the winner. All I had to do was paste my list of hgvs variants to http://uswest.ensembl.org/Homo_sapiens/Tools/VEP and there is a download as VCF option.

ADD COMMENT
1
Entering edit mode
5.0 years ago

Those are essentially genomic coordinates, right? I think you could just parse out the extra crap then give it to VEP to create a valid VCF: https://m.ensembl.org/info/docs/tools/vep/vep_formats.html#default

I've done similar things to get wacky formats into VCF

ADD COMMENT
0
Entering edit mode

Thanks Chris. This is the road I'm heading down right now. The only wrinkle would be with the indels of various sizes complicate things. ~Stephen Williams

ADD REPLY
0
Entering edit mode

how did you deal with the indel position? vep seems to return right normalized instead of left

ADD REPLY
1
Entering edit mode
5.0 years ago

At a minimum, you can make a vcf file with the chromosome, start position, ref and alt, which you have. See the full specification to find out more. For a minimally working example, you can format it like so:

#CHROM POS ID REF ALT QUAL FILTER INFO
20 14370 . G A . . .

The other required fields are replaced with a .. Here's a way you could convert the input:

$ cat variants.txt \
          | sed 's|ins|\t.\t|g' \
          | sed -e 's|del\([ACGT]\)|\t\1\t.|g' \
          | sed -e 's|\([ACGT]\)>\([ACGT]\)|\t\1\t\2|g' \
          | sed 's|:g\.|\t|g' \
          | sed 's|_[0-9]\+||g' \
          | sed 's|$|\t.\t.\t.|g' \
          | awk 'OFS="\t" {print $1,$2,".",$3,$4,$5,$6,$7}'

17      7674180 .       C       A       .       .       .
17      7675997 .       G       T       .       .       .
17      7676257 .       G       A       .       .       .
17      7676088 .       G       C       .       .       .
17      7676215 .       G       A       .       .       .
17      7676152 .       C       .       .       .       .
17      7676381 .       C       A       .       .       .
17      7670712 .       G       .       .       .       .
17      7670716 .       C       G       .       .       .
17      7676264 .       .       A       .       .       .

Now instead output to a file (e.g. add > example.vcf at the end of your file) and you should have a VCF file. Some programs might require your to add the header information (lines that start with # in that specification document) so you might have to tweak that a bit.

ADD COMMENT
0
Entering edit mode

This works great! I can add the header later.

ADD REPLY
0
Entering edit mode

If this works for you, great, but IIRC, that's not valid VCF format, because of the missing "anchor bases" for ins and del. (e.g. G/- should be CG/C) A lot of tools will get angry about that. VEP will fill those in with the missing ref bases, if you do end up needing them.

ADD REPLY

Login before adding your answer.

Traffic: 2173 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6