VCF file manipulation for extracting the required information
3
0
Entering edit mode
3.0 years ago
seta ★ 1.5k

Hi all

Could you please kindly tell me how I can obtain information like a below format from the vcf fils?

chr9 136133530 136133530 A G het
chr9 136135364 136135364 A G hom
chr9 136133582 136133582 G - het
chr9 136134993 136134994 GC TT het
chr9 136135137 136135149 CTTCTGTCCACAC - het


The coordinates are zero-based and relative to hg19 (UCSC) reference genome. In particular, the first column shows the chromosome, the second and the third the chromosome coordinates, the fourth the reference genome, the fifth the mutation and the sixth the zygosity. Columns shall be separated by spaces.

variant calling genome sequencing VCF • 959 views
0
Entering edit mode

you can also probably get this from annotating with ANNOVAR or similar.

1
Entering edit mode
3.0 years ago

Hello seta,

strange format. What are you trying to do with it?

Nevertheless here is a python3 solution:

vcfFile = "sample.vcf"

with open(vcfFile, "r") as vcf:
for record in vcf:
if record.startswith('#'):
continue

data = record.split()
gt = data[-1].split(":")[0]
allele = gt.split("/")

if allele[0] == allele[1]:
zygosity = "hom"
else:
zygosity = "het"

print(
data[0],
int(data[1])-1,
int(data[1])+len(data[3])-2,
data[3],
data[4],
zygosity,
sep=" "
)


fin swimmer

0
Entering edit mode

0
Entering edit mode

I'm sure this solution works but, in the general case:

allele = gt.split("/")


doesn't work if the genotype is phased.

int(data[1])+len(data[3])-2,


doesn't work if there is a INFO/END

 if allele[0] == allele[1]:


doesn't work if genotype is nocall ./. or not diallelic

0
Entering edit mode

Hello Pierre,

of course you're right that there are cases where my code doesn't work. But due to the absence of more informationen of how the input file looks like and how the variant calling was done I startet with the simpliest case, meaning no phasing, no structural variants, no multisample.

If the things become more complicated I would rather use an already existing vcf parser like pyVCF to extract and manipulate the information I need.

fin swimmer

0
Entering edit mode
3.0 years ago
kirannbishwa01 ★ 1.3k

For emperical biologist, I wrote this simple tool just for that purpose.

The only addition required is "zygosity" which I can add if need be. - Let me know.

If you are a programmer than @finswimmer tools might give you more edge. You can even hack into it to serve your purpose. Idk, but looks like @finswimmer script assumes that the vcf is single sample and has GT strictly delimited by "/".

Thanks,

0
Entering edit mode
3.0 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

\$ java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->println(V.getContig()+" "+V.getStart()+" "+V.getEnd()+" "+V.getGenotype(0).getAlleles().stream().map(A->A.isCalled()?A.getDisplayString():".").collect(Collectors.joining(" "))+" "+V.getGenotype(0).getType().name()));' src/test/resources/rotavirus_rf.vcf.gz

RF01 970 970 A A HOM_REF
RF02 251 251 A A HOM_REF
RF02 578 578 G G HOM_REF
RF02 877 877 T A HET
RF02 1726 1726 T T HOM_REF
RF02 1962 1965 TACA TA HET
RF03 1221 1221 C C HOM_REF
RF03 1242 1242 C C HOM_REF
RF03 1688 1688 T T HOM_REF
RF03 1708 1708 G G HOM_REF