Find groups of SNP in a multi sample VCF with bcftools
2
0
Entering edit mode
9 months ago
Axzd ▴ 70

Hello, I have a multisample VCF, 10 samples + one reference sample. I am interesting in group of samples. For examples, if there is this SNP: Chrom_3:1258955 0/1 for sample 1, sample2, sample3, sample4 but that the genotype of all other samples is 0/0, I would like to find it.

So far I have tried

bcftools view -i 'GT[0-3]="alt" & GT[4-10]="ref"' merged_snps_only.bcf.gz

However, this returns nothing. but I do know that there are lines in the vcf that actually satisfies this condition. I extracted the genotypes only and used awk to check that indeed there are SNP that are 0/1 for the first 4 samples and 0/0 for all the others.

But I am puzzled why bcftools doesn't return me the same as awk.

Any idea?

Thanks all :)

EDIT: this is what I did and showed me my condition is indeed satisfied

bcftools query -e'GT ="."' -f'%CHROM %POS [ %GT]\n' merged_snps_only.bcf.gz > merged_genotypes.only  

perl -pie 's/\|/\//g' merged_genotypes.only    #for convenience

awk '$3=="0/1" && $4=="0/1" && $5=="0/1" && $6=="0/1" && $7=="0/0" && $8=="0/0" && $9=="0/0" && $10=="0/0" && $11=="0/0" && $12=="0/0" && $13=="0/0"' merged_genotypes.only > group1_4ALT_others_ref

Output of awk

Chrom_3 1163847  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163852  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163862  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163864  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163867  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163870  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1163880  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1343041  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1378584  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
Chrom_3 1565128  0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0

Then doing

awk '{print $1"\t"$2}' group1_4ALT_others_ref > group1_4ALT_others_ref.coordinates

bcftools view -R group1_4ALT_others_ref.coordinates merged_snps_only.bcf.gz > group1_4ALT_others_ref.vcf

Gives me the vcf I need. I still suspect there must be a less convoluted way of doing this.

vcf SNP bcftools • 902 views
ADD COMMENT
2
Entering edit mode
9 months ago

using vcffilterjdk https://jvarkit.readthedocs.io/en/latest/VcfFilterJdk/

java -jar jvarkit.jar vcffilterjdk  -e 'final Set<String> samples= new HashSet<>(Arrays.asList("S1","S2","S3")); return samples.stream().map(S->variant.getGenotype(S)).allMatch(G->G.isHet()) && variant.getGenotypes().stream().filter(G->!samples.contains(G.getSampleName())).allMatch(G->G.isHomRef());'  in.vcf
ADD COMMENT
0
Entering edit mode

Hello Pierre, this works like a charm.

I am still wondering what's wrong with my bcftools command.

Could you give guidance as how to learn to use your program and form queries with it? It seems insanely powerful but I am a bit at loss with the syntax. Thanks

Edit, how do I exclude the sites which have more than 2 alleles? (phased or not).

ADD REPLY
2
Entering edit mode
9 months ago
pd3 ▴ 60

The symbol & in your expression requests that the condition must be satisfied within one sample. Use && instead. Note there is also the plugin bcftools +contrast which might is also suitable for tasks like this

ADD COMMENT
0
Entering edit mode

Hello, thanks for your answer and sorry for the very long delay, I had missed the notifications. I find the doc of +contrast quite confusing. What's the difference between

bcftools -s S1 -x all_merged.vcf > S1.private.vcf

And

bcftools +contrast -0 S0,S2,S3,S4 -1 S1 > S1.private.vcf

The contrast output is much, much bigger than the first command, actually, it contains almost all SNPs, which are common to all samples, so I don't understand what is contrasted. I feel I don't understand what the +contrast is doing. Thanks

EDIT: I think my problem was that I forgot the -a field. If I do the command

bcftools +contrast -a NOVELGT -0 S0,S2,S3,S4 -1 S1 > S1.private.vcf

This looks like a much better command ...

ADD REPLY

Login before adding your answer.

Traffic: 3126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6