Question: HGVS format to VCF from portal.gdc.cancer.gov
1
gravatar for Srw
12 days ago by
Srw30
Charlottesville, VA
Srw30 wrote:

I'm trying to convert hundreds of variant positions found here to vcf for downstream analyses and cannot find a good way to do this. I found jannovar but that only take variants from .c (coding) and .n (non-coding) positions whereas portal.gdc.cancer.gov produces .g (genome) positions.

An example in hgvs format would be

17:g.7674180C>A
17:g.7675997G>T
17:g.7676257G>A
17:g.7676088G>C
17:g.7676215G>A
17:g.7676152delC
17:g.7676381C>A
17:g.7670712delG
17:g.7670716C>G
17:g.7676264_7676265insA

Any help would be much appreciated.

genome snp hgvs vcf • 174 views
ADD COMMENTlink modified 12 days ago • written 12 days ago by Srw30
2
gravatar for Srw
12 days ago by
Srw30
Charlottesville, VA
Srw30 wrote:

Well, after a bunch of wasted time VEP was the winner. All I had to do was paste my list of hgvs variants to http://uswest.ensembl.org/Homo_sapiens/Tools/VEP and there is a download as VCF option.

ADD COMMENTlink written 12 days ago by Srw30
1
gravatar for Chris Miller
12 days ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

Those are essentially genomic coordinates, right? I think you could just parse out the extra crap then give it to VEP to create a valid VCF: https://m.ensembl.org/info/docs/tools/vep/vep_formats.html#default

I've done similar things to get wacky formats into VCF

ADD COMMENTlink written 12 days ago by Chris Miller20k

Thanks Chris. This is the road I'm heading down right now. The only wrinkle would be with the indels of various sizes complicate things. ~Stephen Williams

ADD REPLYlink modified 12 days ago • written 12 days ago by Srw30
1
gravatar for manuel.belmadani
12 days ago by
Canada
manuel.belmadani730 wrote:

At a minimum, you can make a vcf file with the chromosome, start position, ref and alt, which you have. See the full specification to find out more. For a minimally working example, you can format it like so:

#CHROM POS ID REF ALT QUAL FILTER INFO
20 14370 . G A . . .

The other required fields are replaced with a .. Here's a way you could convert the input:

$ cat variants.txt \
          | sed 's|ins|\t.\t|g' \
          | sed -e 's|del\([ACGT]\)|\t\1\t.|g' \
          | sed -e 's|\([ACGT]\)>\([ACGT]\)|\t\1\t\2|g' \
          | sed 's|:g\.|\t|g' \
          | sed 's|_[0-9]\+||g' \
          | sed 's|$|\t.\t.\t.|g' \
          | awk 'OFS="\t" {print $1,$2,".",$3,$4,$5,$6,$7}'

17      7674180 .       C       A       .       .       .
17      7675997 .       G       T       .       .       .
17      7676257 .       G       A       .       .       .
17      7676088 .       G       C       .       .       .
17      7676215 .       G       A       .       .       .
17      7676152 .       C       .       .       .       .
17      7676381 .       C       A       .       .       .
17      7670712 .       G       .       .       .       .
17      7670716 .       C       G       .       .       .
17      7676264 .       .       A       .       .       .

Now instead output to a file (e.g. add > example.vcf at the end of your file) and you should have a VCF file. Some programs might require your to add the header information (lines that start with # in that specification document) so you might have to tweak that a bit.

ADD COMMENTlink modified 12 days ago • written 12 days ago by manuel.belmadani730

This works great! I can add the header later.

ADD REPLYlink written 12 days ago by Srw30

If this works for you, great, but IIRC, that's not valid VCF format, because of the missing "anchor bases" for ins and del. (e.g. G/- should be CG/C) A lot of tools will get angry about that. VEP will fill those in with the missing ref bases, if you do end up needing them.

ADD REPLYlink written 12 days ago by Chris Miller20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1765 users visited in the last hour