VCF file manipulation for extracting the required information
3
0
Entering edit mode
3.0 years ago
seta ★ 1.5k

Hi all

Could you please kindly tell me how I can obtain information like a below format from the vcf fils?

chr9 136133530 136133530 A G het
chr9 136135364 136135364 A G hom
chr9 136133582 136133582 G - het
chr9 136134993 136134994 GC TT het
chr9 136135137 136135149 CTTCTGTCCACAC - het

The coordinates are zero-based and relative to hg19 (UCSC) reference genome. In particular, the first column shows the chromosome, the second and the third the chromosome coordinates, the fourth the reference genome, the fifth the mutation and the sixth the zygosity. Columns shall be separated by spaces.

Many thanks in advance

variant calling genome sequencing VCF • 959 views
ADD COMMENT
0
Entering edit mode

you can also probably get this from annotating with ANNOVAR or similar.

ADD REPLY
1
Entering edit mode
3.0 years ago

Hello seta,

strange format. What are you trying to do with it?

Nevertheless here is a python3 solution:

vcfFile = "sample.vcf"

with open(vcfFile, "r") as vcf:
    for record in vcf:
        if record.startswith('#'):
            continue

        data = record.split()
        gt = data[-1].split(":")[0]
        allele = gt.split("/")

        if allele[0] == allele[1]:
            zygosity = "hom"
        else:
            zygosity = "het"

        print(
            data[0],
            int(data[1])-1,
            int(data[1])+len(data[3])-2,
            data[3],
            data[4],
            zygosity,
            sep=" "
        )

fin swimmer

ADD COMMENT
0
Entering edit mode

Hi finswimmer, Thank you for your reply. I'll try it.

ADD REPLY
0
Entering edit mode

I'm sure this solution works but, in the general case:

allele = gt.split("/")

doesn't work if the genotype is phased.

int(data[1])+len(data[3])-2,

doesn't work if there is a INFO/END

 if allele[0] == allele[1]:

doesn't work if genotype is nocall ./. or not diallelic

ADD REPLY
0
Entering edit mode

Hello Pierre,

of course you're right that there are cases where my code doesn't work. But due to the absence of more informationen of how the input file looks like and how the variant calling was done I startet with the simpliest case, meaning no phasing, no structural variants, no multisample.

If the things become more complicated I would rather use an already existing vcf parser like pyVCF to extract and manipulate the information I need.

fin swimmer

ADD REPLY
0
Entering edit mode
3.0 years ago
kirannbishwa01 ★ 1.3k

For emperical biologist, I wrote this simple tool just for that purpose.

VCF-simplify: a VCF simplification tool. https://github.com/everestial/VCF-simplify

The only addition required is "zygosity" which I can add if need be. - Let me know.

If you are a programmer than @finswimmer tools might give you more edge. You can even hack into it to serve your purpose. Idk, but looks like @finswimmer script assumes that the vcf is single sample and has GT strictly delimited by "/".

Thanks,

ADD COMMENT
0
Entering edit mode
3.0 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+" "+V.getStart()+" "+V.getEnd()+" "+V.getGenotype(0).getAlleles().stream().map(A->A.isCalled()?A.getDisplayString():".").collect(Collectors.joining(" "))+" "+V.getGenotype(0).getType().name()));' src/test/resources/rotavirus_rf.vcf.gz  


RF01 970 970 A A HOM_REF
RF02 251 251 A A HOM_REF
RF02 578 578 G G HOM_REF
RF02 877 877 T A HET
RF02 1726 1726 T T HOM_REF
RF02 1962 1965 TACA TA HET
RF03 1221 1221 C C HOM_REF
RF03 1242 1242 C C HOM_REF
RF03 1688 1688 T T HOM_REF
RF03 1708 1708 G G HOM_REF
ADD COMMENT

Login before adding your answer.

Traffic: 2078 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6