I'm interested in getting simple "heterozygous" or "homozygous" designations for all of the samples/SNPs in my multisample VCF file. In the past, I have been using the -GF GT option in GATK's VariantsToTable tool, and then annotating my basecalls in Excel as either heterozygous or homozygous. This takes forever since Excel isn't really built for big data like this. Is there a simple way to output all of the SNPs as 0/1, 0/0, 0/1, or 1/1 instead of C/A, A/A, G/T, C/C? My ideal output would be a txt file in a grid similar to how VariantsToTable outputs data: top row is each sample, while first column is the variant coordinates.
if you start from a valid vcf file, you can get your desired output simply with this command:
grep -v ^## input.vcf | cut -f1,2,10- | sed 's/:\S*//g'
grep
to remove all headers but column names, cut
to select chromosome+position+samples' columns, and sed
to remove everything but GT from genotype columns.
Thanks, this seems like the quickest method. I only worry because The Broad's documentation on generating tables from VCF files warns very sternly about not using a dedicated tool to parse out a VCF file
No, really, don't write your own parser if you can avoid it. This is not a comment on how smart or how competent we think you are -- it's a comment on how annoyingly obtuse and convoluted the VCF format is.
However, my VCF file seems valid and I dont see any weird outputs, so thanks again.
You are looking for a conversion to plink format, which you can do with VCFtools --plink
, see this page
Isn't that pretty close to how a vcf file naturally looks?
That's a bit of an understatement. Good that you try to find an alternative!