Question: HGVS format to VCF from portal.gdc.cancer.gov
1
gravatar for Srw
7 months ago by
Srw30
Charlottesville, VA
Srw30 wrote:

I'm trying to convert hundreds of variant positions found here to vcf for downstream analyses and cannot find a good way to do this. I found jannovar but that only take variants from .c (coding) and .n (non-coding) positions whereas portal.gdc.cancer.gov produces .g (genome) positions.

An example in hgvs format would be

17:g.7674180C>A
17:g.7675997G>T
17:g.7676257G>A
17:g.7676088G>C
17:g.7676215G>A
17:g.7676152delC
17:g.7676381C>A
17:g.7670712delG
17:g.7670716C>G
17:g.7676264_7676265insA

Any help would be much appreciated.

genome snp hgvs vcf • 388 views
ADD COMMENTlink modified 7 months ago • written 7 months ago by Srw30
2
gravatar for Srw
7 months ago by
Srw30
Charlottesville, VA
Srw30 wrote:

Well, after a bunch of wasted time VEP was the winner. All I had to do was paste my list of hgvs variants to http://uswest.ensembl.org/Homo_sapiens/Tools/VEP and there is a download as VCF option.

ADD COMMENTlink written 7 months ago by Srw30
1
gravatar for Chris Miller
7 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Those are essentially genomic coordinates, right? I think you could just parse out the extra crap then give it to VEP to create a valid VCF: https://m.ensembl.org/info/docs/tools/vep/vep_formats.html#default

I've done similar things to get wacky formats into VCF

ADD COMMENTlink written 7 months ago by Chris Miller21k

Thanks Chris. This is the road I'm heading down right now. The only wrinkle would be with the indels of various sizes complicate things. ~Stephen Williams

ADD REPLYlink modified 7 months ago • written 7 months ago by Srw30
1
gravatar for manuel.belmadani
7 months ago by
Canada
manuel.belmadani1.1k wrote:

At a minimum, you can make a vcf file with the chromosome, start position, ref and alt, which you have. See the full specification to find out more. For a minimally working example, you can format it like so:

#CHROM POS ID REF ALT QUAL FILTER INFO
20 14370 . G A . . .

The other required fields are replaced with a .. Here's a way you could convert the input:

$ cat variants.txt \
          | sed 's|ins|\t.\t|g' \
          | sed -e 's|del\([ACGT]\)|\t\1\t.|g' \
          | sed -e 's|\([ACGT]\)>\([ACGT]\)|\t\1\t\2|g' \
          | sed 's|:g\.|\t|g' \
          | sed 's|_[0-9]\+||g' \
          | sed 's|$|\t.\t.\t.|g' \
          | awk 'OFS="\t" {print $1,$2,".",$3,$4,$5,$6,$7}'

17      7674180 .       C       A       .       .       .
17      7675997 .       G       T       .       .       .
17      7676257 .       G       A       .       .       .
17      7676088 .       G       C       .       .       .
17      7676215 .       G       A       .       .       .
17      7676152 .       C       .       .       .       .
17      7676381 .       C       A       .       .       .
17      7670712 .       G       .       .       .       .
17      7670716 .       C       G       .       .       .
17      7676264 .       .       A       .       .       .

Now instead output to a file (e.g. add > example.vcf at the end of your file) and you should have a VCF file. Some programs might require your to add the header information (lines that start with # in that specification document) so you might have to tweak that a bit.

ADD COMMENTlink modified 7 months ago • written 7 months ago by manuel.belmadani1.1k

This works great! I can add the header later.

ADD REPLYlink written 7 months ago by Srw30

If this works for you, great, but IIRC, that's not valid VCF format, because of the missing "anchor bases" for ins and del. (e.g. G/- should be CG/C) A lot of tools will get angry about that. VEP will fill those in with the missing ref bases, if you do end up needing them.

ADD REPLYlink written 7 months ago by Chris Miller21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 852 users visited in the last hour