Last update: May 9, 2021
You must normalise your VCF / BCF first; otherwise, this script will not work as expected. You can do this with:
bcftools norm -m-any MyVariants.vcf -Ov > MyVariants.Norm.vcf
I probably should explain what's going on here, too: It is divided into 4 parts (each part indicated by the starting <(bcftools
on each line):
First awk
command: outputs the first few columns from the VCF/BCF
Three BCFtools query
commands: tabulate the number of samples having each variant type (as you
requested). This will work for phased and/or un-phased variants. The output is the 3 columns named nHet
, nHomAlt
, nHomRef
Three BCFtools view
commands: look through the file again, saving sample names into an
array called 'header', and then printing the indices of the array
where a particular field (i.e. sample) has a particular type of
variant. This is repeated for: a, heterozygous (HetSamples
), b, homozygous (alt) (HomSamplesAlt
), and
c, homozygouse (ref) (HomSamplesRef
).
I've tested it and verified results on a handful of 1000 Genome variants. I strongly encourage you to do some rigorous testing.
paste <(bcftools view MyVariants.Norm.vcf |\
awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
!/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHomAlt"} {print gsub(/1\|1|1\/1/, "")}') \
\
<(bcftools query -f '[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf |\
awk 'BEGIN {print "nHomRef"} {print gsub(/0\|0|0\/0/, "")}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HetSamples"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|1|1\|0|0\/1|1\/0/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesAlt"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/1\|1|1\/1/, "", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
<(bcftools view MyVariants.Norm.vcf | awk -F"\t" '/^#CHROM/ {split($0, header, "\t"); print "HomSamplesRef"} \
!/^#CHROM/ {for (i=10; i<=NF; i++) {if (gsub(/0\|0|0\/0/,"", $(i))==1) {printf header[i]","}; if (i==NF) {printf "\n"}}}') \
\
| sed 's/,\t/\t/g' | sed 's/,$//g'
CHR POS ID REF ALT nHet nHomAlt nHomRef HetSamples HomSamplesAlt HomSamplesRef
1 10177 1:10177:10177:A:AC A AC 4 0 1 HG00096,HG00097,HG00099,HG00100 HG00101
1 10235 1:10235:10235:T:TA T TA 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 10352 1:10352:10352:T:TA T TA 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10616 1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C CCGCCGTTGCAAAGGCGCGCCG C 0 5 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 10642 1:10642:10642:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11008 1:11008:11008:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11012 1:11012:11012:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 11063 1:11063:11063:T:G T G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13110 1:13110:13110:G:A G A 1 0 4 HG00097 HG00096,HG00099,HG00100,HG00101
1 13116 1:13116:13116:T:G T G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13118 1:13118:13118:A:G A G 2 0 3 HG00097,HG00101 HG00096,HG00099,HG00100
1 13273 1:13273:13273:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13284 1:13284:13284:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13380 1:13380:13380:C:G C G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13483 1:13483:13483:G:C G C 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13494 1:13494:13494:A:G A G 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 13550 1:13550:13550:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 14464 1:14464:14464:A:T A T 1 1 3 HG00099 HG00096 HG00097,HG00100,HG00101
1 14599 1:14599:14599:T:A T A 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14604 1:14604:14604:A:G A G 2 0 3 HG00097,HG00099 HG00096,HG00100,HG00101
1 14930 1:14930:14930:A:G A G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 14933 1:14933:14933:G:A G A 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15211 1:15211:15211:T:G T G 5 0 0 HG00096,HG00097,HG00099,HG00100,HG00101
1 15245 1:15245:15245:C:T C T 0 0 5 HG00096,HG00097,HG00099,HG00100,HG00101
1 15274 1:15274:15274:A:G A G 3 0 2 HG00096,HG00100,HG00101 HG00097,HG00099
1 15274 1:15274:15274:A:T A T 3 2 0 HG00096,HG00100,HG00101 HG00097,HG00099
Dear Kevin/Sam, these are exactly what I need. However, I am wondering how to output the results into a file (e.g. a text file) instead of printing all of them in the screen. I am sorry that I am a true beginner of this field.
Hey, you just need to use the '
>
' character, which instructs the shell to re-direct the output of the command. The default is to send the command output toSTDOUT
, i.e., the screen.For example, look at these examples:
What is contained in out.out? Now do the same to the other commands that you are using.
Wonderful! This works out perfectly! I am able to save it into a text file by adding '>' at the end of the code. Thanks a million for your help! I have been bothered by the this task (get variant counts (0/1 or 1/1) per SNP and print the sample ID out) for the whole weekend. Finally I found this poster which saved my life!
Okay - great. If you have multi-allelic records in your VCF/BCF, though, then you should split these before running the above script. Multi-allelic records may look like this:
These should be split with
bcftools norm -m-any
and would then become:Also, please note that the script works for both phased and / or un-phased genotypes. Please check some of your output to ensure that it does exactly what you need.
Hi Kevin,
I used your code with some modification to add AC,AN, and AF as I change
(bcftools query -f '[\t%SAMPLE=%GT]\n
MyVariants.Norm.vcf to bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%AC\t%AN\t%AF[\t%SAMPLE=%GT]\n' MyVariants.Norm.vcf
| and also removing the first partBut it neither give me an error nor output. Could you please let me know is the modification right?
Thanks
Hi, I have noticed this code gives some non corrisponding sample numbers and names...
Hi Kevin, this is great. It helped me with a previous analysis. How can we get each sample with their respective genotype nHet, nHomAlt, nHomRef (or Aa, aa, AA), for each SNP?
Hi! I'm trying to parallelize your code (to run on a bunch of files), but it is not working because of multiple processes spawning. Do you have any suggestions on how to parallelize? I tried with for loops and parallel without success
It is difficult for me to comment, as I am not too informed on your IT infrastructure. What I can say is that it would be better to run these piped commands on a single CPU core and to just be patient.
Do you have any suggestions on how to run the commands on a single core for all files in a folder? A simple for loop (like for i in /folder/*vcf) is spawing all processes in a row for many cores, even if I add sleep. Im currently using a centos server without slurm. I usually just use parallel, but this is also spawning hundreds of processes even if I use parallel -j 1
I ended up using a function to run parallel. I also made a small change to reduce the number of child processes (each < creates a new process. In theory, your code could be further optimized to make each unique bcftools call unified with only one awk. I did this with the first part of counting the genotypes)
Great work - thank you for posting the follow-up.