How to remove shared SNP from a VCF with multiple individuals
Entering edit mode
15 months ago
cuamatzi • 0


I have a VCF produced by MafFilter with 29 samples, with the next format (trimmed to 5 strains for easier reading):

##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=gap,Description="At least one sequence contains a gap">
##FILTER=<ID=unk,Description="At least one sequence contains an unresolved character">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Reference   Strain01    Strain02    Strain03    Strain04    Strain05 
chr09   191 .   G   A   .   PASS    AC=1    GT  0   1   0   0   0   0
chr09   1229    .   T   C   .   PASS    AC=1    GT  0   0   0   1   0   0
chr09   1233    .   T   G   .   PASS    AC=1    GT  0   0   0   1   0   0
chr03   121013  .   G   T   .   PASS    AC=29   GT  0   1   1   1   1   1
chr03   121017  .   G   A   .   PASS    AC=29   GT  0   1   1   1   1   1
chr16   551745  .   T   A   .   PASS    AC=28   GT  0   0   1   1   1   1
chr16   552420  .   A   G   .   PASS    AC=26   GT  0   1   1   0   1   1

This VCF derives from a multiple genome alignment, where Reference is my reference genome, and Strain01 is a collection strain, the Strain02-29 are clones derived from Strain01, that were exposed to some mutagens.

I'd like to remove all the SNPs present in Strain01 from the rest of my strains.

I used the following bcftools command

bcftools view -e'AC=29' input.vcf.gz | bgzip -c > output.vcf.gz

This excludes all variants with AC=29 (meaning that the variants are present in the 29 strains). However, I have some cases where one or more strains don't have one or more SNP from Strain01 but the rest of the strains do (e.g. AC=26 or AC=28). I can set a threshold (e.g 20) and use:

bcftools view -e'AC>20' input.vcf.gz | bgzip -c > output.vcf.gz

But, it could be the case that some strains still carry SNPs present in Strain01. I was thinking in split the VCF into individual VCF files for each strain and then use bcftools isec or vcf-isec, but I'd prefer work with the "full vcf"

Is there a tool or command where I can indicate Strain01 as my background and remove its contribution from all my strains?

Thank you in advance!

VCF background filtering remove • 533 views
Entering edit mode
15 months ago

I'm not sure I understand clearly, however, using jvarkit vcffilterjdk . The following cmd returns all variant where ANY sample Strain2-* is different from Strain01

java -jar ${JVARKIT_DIST}/vcffilterjdk.jar -e 'final Genotype g=variant.getGenotype("Strain01"); return variant.getGenotypes().stream().filter(G->!G.getSampleName().equals(g.getSampleName())).anyMatch(G->!G.sameGenotype(g));' in.vcf

Login before adding your answer.

Traffic: 1007 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6