Question

VCF file manipulation for extracting the required information

0

Entering edit mode

6.0 years ago

seta ★ 1.9k

Hi all

Could you please kindly tell me how I can obtain information like a below format from the vcf fils?

chr9 136133530 136133530 A G het
chr9 136135364 136135364 A G hom
chr9 136133582 136133582 G - het
chr9 136134993 136134994 GC TT het
chr9 136135137 136135149 CTTCTGTCCACAC - het

The coordinates are zero-based and relative to hg19 (UCSC) reference genome. In particular, the first column shows the chromosome, the second and the third the chromosome coordinates, the fourth the reference genome, the fifth the mutation and the sixth the zygosity. Columns shall be separated by spaces.

Many thanks in advance

variant calling genome sequencing VCF • 2.4k views

ADD COMMENT • link updated 5.4 years ago by zx8754 11k • written 6.0 years ago by seta ★ 1.9k

0

Entering edit mode

you can also probably get this from annotating with ANNOVAR or similar.

ADD REPLY • link 6.0 years ago by steve ★ 3.5k

score 1 · Answer 1 · 2018-04-25

1

Entering edit mode

6.0 years ago

finswimmer 16k

Hello seta,

strange format. What are you trying to do with it?

Nevertheless here is a python3 solution:

vcfFile = "sample.vcf"

with open(vcfFile, "r") as vcf:
    for record in vcf:
        if record.startswith('#'):
            continue

        data = record.split()
        gt = data[-1].split(":")[0]
        allele = gt.split("/")

        if allele[0] == allele[1]:
            zygosity = "hom"
        else:
            zygosity = "het"

        print(
            data[0],
            int(data[1])-1,
            int(data[1])+len(data[3])-2,
            data[3],
            data[4],
            zygosity,
            sep=" "
        )

fin swimmer

ADD COMMENT • link 6.0 years ago by finswimmer 16k

0

Entering edit mode

Hi finswimmer, Thank you for your reply. I'll try it.

ADD REPLY • link 6.0 years ago by seta ★ 1.9k

0

Entering edit mode

I'm sure this solution works but, in the general case:

allele = gt.split("/")

doesn't work if the genotype is phased.

int(data[1])+len(data[3])-2,

doesn't work if there is a INFO/END

 if allele[0] == allele[1]:

doesn't work if genotype is nocall ./. or not diallelic

ADD REPLY • link 6.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hello Pierre,

of course you're right that there are cases where my code doesn't work. But due to the absence of more informationen of how the input file looks like and how the variant calling was done I startet with the simpliest case, meaning no phasing, no structural variants, no multisample.

If the things become more complicated I would rather use an already existing vcf parser like pyVCF to extract and manipulate the information I need.

fin swimmer

ADD REPLY • link 6.0 years ago by finswimmer 16k

score 0 · Answer 2 · 2018-04-25

For emperical biologist, I wrote this simple tool just for that purpose.

VCF-simplify: a VCF simplification tool. https://github.com/everestial/VCF-simplify

The only addition required is "zygosity" which I can add if need be. - Let me know.

If you are a programmer than @finswimmer tools might give you more edge. You can even hack into it to serve your purpose. Idk, but looks like @finswimmer script assumes that the vcf is single sample and has GT strictly delimited by "/".

Thanks,

score 0 · Answer 3 · 2018-04-25

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+" "+V.getStart()+" "+V.getEnd()+" "+V.getGenotype(0).getAlleles().stream().map(A->A.isCalled()?A.getDisplayString():".").collect(Collectors.joining(" "))+" "+V.getGenotype(0).getType().name()));' src/test/resources/rotavirus_rf.vcf.gz  


RF01 970 970 A A HOM_REF
RF02 251 251 A A HOM_REF
RF02 578 578 G G HOM_REF
RF02 877 877 T A HET
RF02 1726 1726 T T HOM_REF
RF02 1962 1965 TACA TA HET
RF03 1221 1221 C C HOM_REF
RF03 1242 1242 C C HOM_REF
RF03 1688 1688 T T HOM_REF
RF03 1708 1708 G G HOM_REF