Question

Get genotype GT from multi-sample VCF (per chromosome)

1

Entering edit mode

5.4 years ago

zx8754 11k

Example input: multi-sample VCF (adapted from www.internationalgenome.org):

Note: my actual file is bgzipped and tabixed with ~2Mln variants (rows) and ~1000 samples (columns).

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0/0:48:1:51,51 1/0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0/0:49:3:58,50 0/1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1/2:21:6:23,27 2/1:2:0:18,2   2/2:35:4
21     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0/0:54:7:56,60 0/0:48:4:51,51 0/0:61:2
21     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

Expected output:

20.txt

NA00001 NA00002 NA00003
0/0 1/0 1/1
0/0 0/1 0/0
1/2 1/2 2/2

21.txt

NA00001 NA00002 NA00003
0/0 0/0 0/0
0/1 0/2 1/1

I was thinking of using cut | sed combo with some regex, but thought there must be already some tool out there, maybe bcftools (couldn't get the right flags to work) ?

Any other ideas?

vcf subset genotype bcftools • 3.4k views

ADD COMMENT • link updated 5.4 years ago by Pierre Lindenbaum 161k • written 5.4 years ago by zx8754 11k

2

Entering edit mode

5.4 years ago

finswimmer 16k

To get file per chromosome use this:

$ tabix -l input.vcf.gz|parallel 'bcftools query -r {} -H -f "[%GT\t]\n" input.vcf.gz | sed "s/\t$//"> {}.txt'

fin swimmer

ADD COMMENT • link 5.4 years ago by finswimmer 16k

0

Entering edit mode

I recommend this as it automates chromosome names (regions)

ADD REPLY • link 5.4 years ago by Ram 43k

1

Entering edit mode

5.4 years ago

trausch ★ 1.9k

bcftools query should work.

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  <input.vcf>

ADD COMMENT • link 5.4 years ago by trausch ★ 1.9k

0

Entering edit mode

Looks good, but '%CHROM\t%POS\t%REF\t%ALT' is not in the requested example output.

ADD REPLY • link 5.4 years ago by WouterDeCoster 47k

2

Entering edit mode

I guess, he could just split and then cut away the first column.

bcftools query -f  '%CHROM[\t%GT]\n' <input.bcf> | awk '{print $0 >> "chrom"$1".bed"}'

ADD REPLY • link 5.4 years ago by trausch ★ 1.9k

0

Entering edit mode

No need for the chrom in the output filename though, but yeah your solution is elegant :-)

ADD REPLY • link 5.4 years ago by Ram 43k

0

Entering edit mode

Plus, OP needs per-chromosome output files. This is definitely a major step in the solution, but it needs wrappers around it to get from one end to the other. It does address the gist of the problem though.

ADD REPLY • link 5.4 years ago by Ram 43k

1

Entering edit mode

5.4 years ago

WouterDeCoster 47k

This should work but is not pretty:

cat <(grep '#CHROM' | cut -f10-) <(grep -v '^#' variants.vcf | cut -f10- | sed -e 's|:[^\t]*\t|\t|g' | cut -f1 -d':') > genotypes.txt

cat two greps together
first grep gets the desired part of the header
second grep gets all variants and does some sed/cut processing to get the genotypes

I'm sure the sed could be improved to remove the need of the final cut (required to get the genotype of the last sample)

ADD COMMENT • link 5.4 years ago by WouterDeCoster 47k

score 3 · Accepted Answer · 2018-11-20

3

Entering edit mode

5.4 years ago

Ram 43k

From your expected output, it is evident that you wish to not split multi-allelic records. I strongly recommend splitting them as tools become more predictable with normalized variants.

You can use

bcftools query -r <chr> -Hf '[ %GT]\n' # to get genotypes

to get this output, but beware, the header will be a bit wonky, in the format [<col_index>]<sample_name>:GT. You can sed replace the \[\d+\]([^:]+):GT with \1 to clean up the header and get to your desired output.

for chr in 20 21
do
bcftools query -Ov -r ${chr} -Hf '[ %GT]\n' | sed -r '1s#\[\d+\]([^:]+):GT#\1#g' >${chr}.txt #command untested, use with caution
done

ADD COMMENT • link 5.4 years ago by Ram 43k

0

Entering edit mode

Yes, you are right, I have some multiallelic variants.

ADD REPLY • link 5.4 years ago by zx8754 11k

2

Entering edit mode

In my experience, I've encountered bugs when multi-allelic variants were not split, but I use bcftools heavily and bcftools has problems with multi-allelic variants so I might be biased. bcftools can do almost everything you need though, so should you choose to use it as a go-to tool, I'd recommend adding a bcftools norm -m- before any other bcftools command.

ADD REPLY • link 5.4 years ago by Ram 43k

score 3 · Accepted Answer · 2018-11-20

3

Entering edit mode

5.4 years ago

Pierre Lindenbaum 161k

bcftools annotate -x '^FORMAT/GT' input.vcf.gz | grep -v "^##" | cut -f 10-