Last update: May 9, 2021
You must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:
bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf
I probably should explain what's going on here, too: It is divided into 4 parts (each part indicated by the starting <(bcftools
on each line):
First awk
command: outputs the first few columns from the VCF/BCF
Three BCFtools query
commands: tabulate the number of samples having each variant type (as you
requested). This will work for phased and/or un-phased variants. The output is the 3 columns named nHet
, nHomAlt
, nHomRef
Three BCFtools view
commands: look through the file again, saving sample names into an
array called 'header', and then printing the indices of the array
where a particular field (i.e. sample) has a particular type of
variant. This is repeated for: a, heterozygous (HetSamples
), b, homozygous (alt) (HomSamplesAlt
), and
c, homozygouse (ref) (HomSamplesRef
).
I've tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.
paste <(bcftools view MyVariants.Norm.vcf |\
awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
!/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HetSamples"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|1|1\|0|0\/1|1\/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesAlt"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1\|1|1\/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesRef"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|0|0\/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
| sed 's/,\t/\t/g' | sed 's/,$//g'
CHR POS ID REF ALT nHet nHomAlt nHomRef HetSamples HomSamplesAlt HomSamplesRef
1 10177 1:10177:10177:A:AC A AC 4 0 1 HG00096,HG00097,HG00099,HG00100 HG00101
1 10235 1:10235:10235:T:TA T TA 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 10352 1:10352:10352:T:TA T TA 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10616 1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0 5 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10642 1:10642:10642:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11008 1:11008:11008:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11012 1:11012:11012:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11063 1:11063:11063:T:G T G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13110 1:13110:13110:G:A G A 1 0 4 HG00097 HG00096,HG00099,HG00100,HG00101
1 13116 1:13116:13116:T:G T G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13118 1:13118:13118:A:G A G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13273 1:13273:13273:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13284 1:13284:13284:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13380 1:13380:13380:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13483 1:13483:13483:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13494 1:13494:13494:A:G A G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13550 1:13550:13550:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 14464 1:14464:14464:A:T A T 1 1 3 HG00099 HG00096 HG00097,HG00100,HG00101
1 14599 1:14599:14599:T:A T A 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14604 1:14604:14604:A:G A G 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14930 1:14930:14930:A:G A G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 14933 1:14933:14933:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15211 1:15211:15211:T:G T G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 15245 1:15245:15245:C:T C T 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15274 1:15274:15274:A:G A G 3 0 2 HG00096,HG00100,HG00101 HG00097,HG00099
1 15274 1:15274:15274:A:T A T 3 2 0 HG00096,HG00100,HG00101 HG00097,HG00099
Dear Kevin/Sam, these are exactly what I need. However, I am wondering how to output the results into a file (e.g. a text file) instead of printing all of them in the screen. I am sorry that I am a true beginner of this field.
Hey, you just need to use the '
>
' character, which instructs the shell to re-direct the output of the command. The default is to send the command output toSTDOUT
, i.e., the screen.For example, look at these examples:
What is contained in out.out? Now do the same to the other commands that you are using.
Wonderful! This works out perfectly! I am able to save it into a text file by adding '>' at the end of the code. Thanks a million for your help! I have been bothered by the this task (get variant counts (0/1 or 1/1) per SNP and print the sample ID out) for the whole weekend. Finally I found this poster which saved my life!
Okay - great. If you have multi-allelic records in your VCF/BCF, though, then you should split these before running the above script. Multi-allelic records may look like this:
These should be split with
bcftools norm -m-any
and would then become:Also, please note that the script works for both phased and / or un-phased genotypes. Please check some of your output to ensure that it does exactly what you need.
Hi Kevin,
I used your code with some modification to add AC,AN, and AF as I change
(bcftools query -f '[\t%SAMPLE=%GT]\n
MyVariants.Norm.vcf to bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%AF[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf
| and also removing the first partBut it neither give me an error nor output. Could you please let me know is the modification right?
Thanks
Hi, I have noticed this code gives some non corrisponding sample numbers and names...