Get genotype GT from multi-sample VCF (per chromosome)
5
1
Entering edit mode
5.4 years ago
zx8754 11k

Example input: multi-sample VCF (adapted from www.internationalgenome.org):

Note: my actual file is bgzipped and tabixed with ~2Mln variants (rows) and ~1000 samples (columns).

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0/0:48:1:51,51 1/0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0/0:49:3:58,50 0/1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1/2:21:6:23,27 2/1:2:0:18,2   2/2:35:4
21     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0/0:54:7:56,60 0/0:48:4:51,51 0/0:61:2
21     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

Expected output:

20.txt

NA00001 NA00002 NA00003
0/0 1/0 1/1
0/0 0/1 0/0
1/2 1/2 2/2

21.txt

NA00001 NA00002 NA00003
0/0 0/0 0/0
0/1 0/2 1/1

I was thinking of using cut | sed combo with some regex, but thought there must be already some tool out there, maybe bcftools (couldn't get the right flags to work) ?

Any other ideas?

vcf subset genotype bcftools • 3.4k views
ADD COMMENT
3
Entering edit mode
5.4 years ago
Ram 43k

From your expected output, it is evident that you wish to not split multi-allelic records. I strongly recommend splitting them as tools become more predictable with normalized variants.

You can use

bcftools query -r <chr> -Hf '[ %GT]\n' # to get genotypes

to get this output, but beware, the header will be a bit wonky, in the format [<col_index>]<sample_name>:GT. You can sed replace the \[\d+\]([^:]+):GT with \1 to clean up the header and get to your desired output.

for chr in 20 21
do
bcftools query -Ov -r ${chr} -Hf '[ %GT]\n' | sed -r '1s#\[\d+\]([^:]+):GT#\1#g' >${chr}.txt #command untested, use with caution
done
ADD COMMENT
0
Entering edit mode

Yes, you are right, I have some multiallelic variants.

ADD REPLY
2
Entering edit mode

In my experience, I've encountered bugs when multi-allelic variants were not split, but I use bcftools heavily and bcftools has problems with multi-allelic variants so I might be biased. bcftools can do almost everything you need though, so should you choose to use it as a go-to tool, I'd recommend adding a bcftools norm -m- before any other bcftools command.

ADD REPLY
3
Entering edit mode
5.4 years ago
bcftools annotate -x '^FORMAT/GT' input.vcf.gz | grep -v "^##" | cut -f 10-
ADD COMMENT
1
Entering edit mode

I'm surprised you did not recommend BioAlcidae :-)

ADD REPLY
2
Entering edit mode
5.4 years ago

To get file per chromosome use this:

$ tabix -l input.vcf.gz|parallel 'bcftools query -r {} -H -f "[%GT\t]\n" input.vcf.gz | sed "s/\t$//"> {}.txt'

fin swimmer

ADD COMMENT
0
Entering edit mode

I recommend this as it automates chromosome names (regions)

ADD REPLY
1
Entering edit mode
5.4 years ago
trausch ★ 1.9k

bcftools query should work.

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  <input.vcf>
ADD COMMENT
0
Entering edit mode

Looks good, but '%CHROM\t%POS\t%REF\t%ALT' is not in the requested example output.

ADD REPLY
2
Entering edit mode

I guess, he could just split and then cut away the first column.

bcftools query -f  '%CHROM[\t%GT]\n' <input.bcf> | awk '{print $0 >> "chrom"$1".bed"}'
ADD REPLY
0
Entering edit mode

No need for the chrom in the output filename though, but yeah your solution is elegant :-)

ADD REPLY
0
Entering edit mode

Plus, OP needs per-chromosome output files. This is definitely a major step in the solution, but it needs wrappers around it to get from one end to the other. It does address the gist of the problem though.

ADD REPLY
1
Entering edit mode
5.4 years ago

This should work but is not pretty:

cat <(grep '#CHROM' | cut -f10-) <(grep -v '^#' variants.vcf | cut -f10- | sed -e 's|:[^\t]*\t|\t|g' | cut -f1 -d':') > genotypes.txt
  • cat two greps together
  • first grep gets the desired part of the header
  • second grep gets all variants and does some sed/cut processing to get the genotypes

I'm sure the sed could be improved to remove the need of the final cut (required to get the genotype of the last sample)

ADD COMMENT

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6