Count number of occurrences an annotation in a multisample vcf split into multiple chunks
Entering edit mode
2.2 years ago
tacrolimus ▴ 140

Dear Biostars,

I have a NGS data for thousands of people merged into a multisample vcf which has been split into genomic chunks (roughly 1000 chunks). These have been annotated with VEP. One of those annotations is the LOFTEE "high confidence" marker.

I want to count per sample the number of LOFTEE "HC" hits there are per sample across all chunks so the final output would simple be the sample ID and a number representing the total number of high confidence markers that person has.

I have tried:

bcftools view -S $sample_list | \
bcftools +split-vep \
-d \
-f '%CHROM\t%POS\t%REF\t%ALT\t%SYMBOL\t%Feature\t%Consequence\t%Existing_variation\t%LoF\n' \
-i "LOF='HC'" \
-o output_${i}.txt

where $sample file is an array over all the chunks but I am not getting it per sample ID.

Any help would be greatly appreciated.

bcftools vcf annotation • 783 views
Entering edit mode
2.2 years ago

Correct me if I'm wrong, but the LOFTEE annotation depends on the variant and not on the genotype, therefore the LOF='HC' annotation would be inside the INFO column. If that'd the case, I'd suggest a search+split+count strategy rather than a split+search+count one.

Firstly extract LOF='HC' variants to a new multisample file

bcftools view -i ''LOF='HC'" -Oz -o multisample.LOF_HC.vcf.gz multisample.vcf.gz

and secondly split that new file by sample and count private variants in it

for sample in `bcftools query -l multisample.LOF_HC.vcf.gz`; do
 echo -n "$sample "
 bcftools view -c1 -s $sample multisample.LOF_HC.vcf.gz | grep -cv ^#
done > samples.txt

Note that this code is not tested, and that the split loop can take quite a while if the number of samples is large. If that'd be the case, then I'd go for a programmed solution by parsing the file directly:

zgrep -v ^## multisample.vcf.gz \
| perl -lane 'if (/^#/) {
 foreach (@F) { push @h, $_; $s{$_} = 0 }
} elsif ($F[7] =~ /LOF='HC'/) {
 for ($i = 9; $i <= $#F; $i++) {
  ($gt = $F[$i]) =~ s/:.+//;
  if ($gt =~ /[1-9]/) { $s{$h[$i]}++ }
} END {
 foreach (@h[9..$#h]) { print "$_\t$s{$_}" }
}' > samples.txt

(This code has been successfully tested on 1000g multisample files)

Entering edit mode

Jorge Amigo thank you so much for taking the time to reply! You are right, this is a much better strategy. Will set it running and let you know if any troubles (its 88000 people in a chunked multisample vcf file).

All the best


Login before adding your answer.

Traffic: 2241 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6