HGVS format to VCF from portal.gdc.cancer.gov
3
1
Entering edit mode
3.7 years ago
Srw ▴ 60

I'm trying to convert hundreds of variant positions found here to vcf for downstream analyses and cannot find a good way to do this. I found jannovar but that only take variants from .c (coding) and .n (non-coding) positions whereas portal.gdc.cancer.gov produces .g (genome) positions.

An example in hgvs format would be

17:g.7674180C>A
17:g.7675997G>T
17:g.7676257G>A
17:g.7676088G>C
17:g.7676215G>A
17:g.7676152delC
17:g.7676381C>A
17:g.7670712delG
17:g.7670716C>G
17:g.7676264_7676265insA


Any help would be much appreciated.

vcf hgvs genome snp • 1.7k views
2
Entering edit mode
3.7 years ago
Srw ▴ 60

Well, after a bunch of wasted time VEP was the winner. All I had to do was paste my list of hgvs variants to http://uswest.ensembl.org/Homo_sapiens/Tools/VEP and there is a download as VCF option.

1
Entering edit mode
3.7 years ago

Those are essentially genomic coordinates, right? I think you could just parse out the extra crap then give it to VEP to create a valid VCF: https://m.ensembl.org/info/docs/tools/vep/vep_formats.html#default

I've done similar things to get wacky formats into VCF

0
Entering edit mode

Thanks Chris. This is the road I'm heading down right now. The only wrinkle would be with the indels of various sizes complicate things. ~Stephen Williams

0
Entering edit mode

how did you deal with the indel position? vep seems to return right normalized instead of left

1
Entering edit mode
3.7 years ago

At a minimum, you can make a vcf file with the chromosome, start position, ref and alt, which you have. See the full specification to find out more. For a minimally working example, you can format it like so:

#CHROM POS ID REF ALT QUAL FILTER INFO
20 14370 . G A . . .


The other required fields are replaced with a .. Here's a way you could convert the input:

$cat variants.txt \ | sed 's|ins|\t.\t|g' \ | sed -e 's|del$$[ACGT]$$|\t\1\t.|g' \ | sed -e 's|$$[ACGT]$$>$$[ACGT]$$|\t\1\t\2|g' \ | sed 's|:g\.|\t|g' \ | sed 's|_[0-9]\+||g' \ | sed 's|$|\t.\t.\t.|g' \
| awk 'OFS="\t" {print $1,$2,".",$3,$4,$5,$6,\$7}'

17      7674180 .       C       A       .       .       .
17      7675997 .       G       T       .       .       .
17      7676257 .       G       A       .       .       .
17      7676088 .       G       C       .       .       .
17      7676215 .       G       A       .       .       .
17      7676152 .       C       .       .       .       .
17      7676381 .       C       A       .       .       .
17      7670712 .       G       .       .       .       .
17      7670716 .       C       G       .       .       .
17      7676264 .       .       A       .       .       .


Now instead output to a file (e.g. add > example.vcf at the end of your file) and you should have a VCF file. Some programs might require your to add the header information (lines that start with # in that specification document) so you might have to tweak that a bit.

0
Entering edit mode

0
Entering edit mode

If this works for you, great, but IIRC, that's not valid VCF format, because of the missing "anchor bases" for ins and del. (e.g. G/- should be CG/C) A lot of tools will get angry about that. VEP will fill those in with the missing ref bases, if you do end up needing them.