Question: Get genotype GT from multi-sample VCF (per chromosome)
1
gravatar for zx8754
20 months ago by
zx87549.4k
London
zx87549.4k wrote:

Example input: multi-sample VCF (adapted from www.internationalgenome.org):

Note: my actual file is bgzipped and tabixed with ~2Mln variants (rows) and ~1000 samples (columns).

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0/0:48:1:51,51 1/0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0/0:49:3:58,50 0/1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1/2:21:6:23,27 2/1:2:0:18,2   2/2:35:4
21     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0/0:54:7:56,60 0/0:48:4:51,51 0/0:61:2
21     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

Expected output:

20.txt

NA00001 NA00002 NA00003
0/0 1/0 1/1
0/0 0/1 0/0
1/2 1/2 2/2

21.txt

NA00001 NA00002 NA00003
0/0 0/0 0/0
0/1 0/2 1/1

I was thinking of using cut | sed combo with some regex, but thought there must be already some tool out there, maybe bcftools (couldn't get the right flags to work) ?

Any other ideas?

subset genotype bcftools vcf • 681 views
ADD COMMENTlink modified 20 months ago by Pierre Lindenbaum129k • written 20 months ago by zx87549.4k
3
gravatar for RamRS
20 months ago by
RamRS27k
Houston, TX
RamRS27k wrote:

From your expected output, it is evident that you wish to not split multi-allelic records. I strongly recommend splitting them as tools become more predictable with normalized variants.

You can use

bcftools query -r <chr> -Hf '[ %GT]\n' # to get genotypes

to get this output, but beware, the header will be a bit wonky, in the format [<col_index>]<sample_name>:GT. You can sed replace the \[\d+\]([^:]+):GT with \1 to clean up the header and get to your desired output.

for chr in 20 21
do
bcftools query -Ov -r ${chr} -Hf '[ %GT]\n' | sed -r '1s#\[\d+\]([^:]+):GT#\1#g' >${chr}.txt #command untested, use with caution
done
ADD COMMENTlink modified 20 months ago • written 20 months ago by RamRS27k

Yes, you are right, I have some multiallelic variants.

ADD REPLYlink written 20 months ago by zx87549.4k
2

In my experience, I've encountered bugs when multi-allelic variants were not split, but I use bcftools heavily and bcftools has problems with multi-allelic variants so I might be biased. bcftools can do almost everything you need though, so should you choose to use it as a go-to tool, I'd recommend adding a bcftools norm -m- before any other bcftools command.

ADD REPLYlink written 20 months ago by RamRS27k
3
gravatar for Pierre Lindenbaum
20 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
bcftools annotate -x '^FORMAT/GT' input.vcf.gz | grep -v "^##" | cut -f 10-
ADD COMMENTlink written 20 months ago by Pierre Lindenbaum129k
1

I'm surprised you did not recommend BioAlcidae :-)

ADD REPLYlink written 20 months ago by RamRS27k
2
gravatar for finswimmer
20 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

To get file per chromosome use this:

$ tabix -l input.vcf.gz|parallel 'bcftools query -r {} -H -f "[%GT\t]\n" input.vcf.gz | sed "s/\t$//"> {}.txt'

fin swimmer

ADD COMMENTlink modified 20 months ago • written 20 months ago by finswimmer13k

I recommend this as it automates chromosome names (regions)

ADD REPLYlink written 20 months ago by RamRS27k
1
gravatar for trausch
20 months ago by
trausch1.5k
Germany
trausch1.5k wrote:

bcftools query should work.

bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT[\t%GT]\n'  <input.vcf>
ADD COMMENTlink written 20 months ago by trausch1.5k

Looks good, but '%CHROM\t%POS\t%REF\t%ALT' is not in the requested example output.

ADD REPLYlink written 20 months ago by WouterDeCoster44k
2

I guess, he could just split and then cut away the first column.

bcftools query -f  '%CHROM[\t%GT]\n' <input.bcf> | awk '{print $0 >> "chrom"$1".bed"}'
ADD REPLYlink written 20 months ago by trausch1.5k

No need for the chrom in the output filename though, but yeah your solution is elegant :-)

ADD REPLYlink written 20 months ago by RamRS27k

Plus, OP needs per-chromosome output files. This is definitely a major step in the solution, but it needs wrappers around it to get from one end to the other. It does address the gist of the problem though.

ADD REPLYlink written 20 months ago by RamRS27k
1
gravatar for WouterDeCoster
20 months ago by
Belgium
WouterDeCoster44k wrote:

This should work but is not pretty:

cat <(grep '#CHROM' | cut -f10-) <(grep -v '^#' variants.vcf | cut -f10- | sed -e 's|:[^\t]*\t|\t|g' | cut -f1 -d':') > genotypes.txt
  • cat two greps together
  • first grep gets the desired part of the header
  • second grep gets all variants and does some sed/cut processing to get the genotypes

I'm sure the sed could be improved to remove the need of the final cut (required to get the genotype of the last sample)

ADD COMMENTlink written 20 months ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1203 users visited in the last hour