Question: find homozygous SNPs between two samples's VCF files.
0
gravatar for pramodkp97
2.4 years ago by
pramodkp970
pramodkp970 wrote:

I have done variant calling by bcftools and got vcf files for 18 different samples. Now I want to get the number of SNPs between the samples. Ref or ALT allele should be in the homozygous condition. if one samples has homozygous Ref allele and other sample has homozygous allele then take them into consideration but if both the samples has homozygous ALT allele then ignore that becouse this would not be SNP between the samples. I want to ignore the hetero condition of alleles.

I tried vcf-compare that does not seems answering what i need.

snp vcffile assembly genome • 1.6k views
ADD COMMENTlink modified 2.4 years ago by William4.7k • written 2.4 years ago by pramodkp970

do you have 18 VCF or one VCF with 18 samples ? Also , do you have searched biostars.org for similar posts ?

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum129k

I have 18 vcf files and i also single vcf file for 18 samples made by vcf-merge command. I am okay to choose any one if the solution is there. I have tried searching in biostars.

ADD REPLYlink written 2.4 years ago by pramodkp970
1
gravatar for Pierre Lindenbaum
2.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

single vcf file for 18 sample

using bioalcidaejdk and a multi-sample vcf: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

loop over each genotype (gx) vs another genotype(gy) , check ( gx.isHomRef && gy.isHomVar) OR (gy.isHomRef && gx.isHomVar)

 gunzip -c input.vcf.gz  |\
java -jar dist/bioalcidaejdk.jar -e 'final Map<String,Long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V));' -F VCF


SAMPLEHOOA  SAMPLEHOZI  10
SAMPLEHOQL  SAMPLE0PU4  2
SAMPLE0Q1Q  SAMPLE0Q3Z  2
SAMPLEHOST  SAMPLE0Q09  2
SAMPLEHONL  SAMPLE0Q1N  2
(...)
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum129k

Thank you for the help but there is problem whicle running this command.

java -jar bioalcidaejdk.jar -e 'final Map<String,Long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V));' -F VCF all18_samples_merged.vcf

It generated following:

[DEBUG][BioAlcidaeJdk] Compiling : 1 import java.util.; 2 import java.util.stream.; 3 import java.util.regex.; 4 import java.util.function.; 5 import htsjdk.samtools.; 6 import htsjdk.samtools.util.; 7 import htsjdk.variant.variantcontext.; 8 import htsjdk.variant.vcf.; 9 import com.github.lindenb.jvarkit.util.bio.fasta.FastaSequence; 10 import javax.annotation.Generated; 11 import htsjdk.variant.vcf.; 12 /* begin user's packages / 13 /* end user's packages */ 14 @Generated(value="BioAlcidaeJdk",date="2018-03-27T15:38:35+0530") 15 public class BioAlcidaeJdkCustom461705101 extends com.github.lindenb.jvarkit.tools.bioalcidae.BioAlcidaeJdk.VcfHandler { 16 public BioAlcidaeJdkCustom461705101() { 17 } 18 @Override 19 public void execute() throws Exception { 20 // user's code starts here 21 final Map<string,long> count = new HashMap<>(); stream().forEach(V->{for(int x=0;x+1< V.getNSamples();x++) {Genotype gx=V.getGenotype(x);if(!gx.isHom()) continue; for(int y=x+1;y< V.getNSamples();++y) {Genotype gy=V.getGenotype(y);if(!gy.isHom()) continue; if(!((gx.isHomVar() && gy.isHomRef()) || (gx.isHomRef() && gy.isHomVar()))) continue; final String pair=gx.getSampleName()+"\t"+gy.getSampleName();long n=count.getOrDefault(pair,0L);count.put(pair,n+1L);}}});count.forEach((K,V)->println(K+"\t"+V)); 22 //user's code ends here 23 } 24 }

[WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.

[WARN][VepPredictionParser]NO INFO[CSQ] found in header. This VCF was probably NOT annotated with VEP. But it's not a problem if this tool doesn't need to access VEP Annotations.

[WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.

PLEASE give your suggestion what i am doing wrong.

Do i need to compress into .gz format and run exactly what you have suggested.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by pramodkp970

there is no problem, these are just some basic messages. You can hide them by redirecting stderr

java -jar bioalcidaejdk.jar -e '(...)' -F VCF all18_samples_merged.vcf 2> /dev/null

if there is no output, it means that there is no genotype matching your criteria.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum129k

after

... count.forEach((K,V)->println(K+"\t"+V));'

you can print the number of items in 'count'

... count.forEach((K,V)->println(K+"\t"+V)); System.out.println("SIZE:"+count.size());'
ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum129k

yes It is not generating any output but in reality there should be output of some SNPs between the samples.

ADD REPLYlink written 2.4 years ago by pramodkp970

show me a VCF line containing a positive test please.

ADD REPLYlink written 2.4 years ago by Pierre Lindenbaum129k
----------
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  filtered_16S_1-1_default_sort.bam       filtered_16S_2-1_default_sort.bam       filtered_16S_3-1_default_sort.bam       filtered_16S_4-1_default_sort.bam       filtered_16S_5-1_default_sort.bam       filtered_16S_6-1_default_sort.bam       filtered_16S_7-1_default_sort.bam       filtered_16S_8-1_default_sort.bam       filtered_16S_9-1_default_sort.bam       filtered_16S_10-1_default_sort.bam      filtered_16S_11-1_default_sort.bam      filtered_16S_12-1_default_sort.bam      filtered_16S_13-1_default_sort.bam      filtered_16S_14-1_default_sort.bam      filtered_16S_15-1_default_sort.bam      filtered_16S_16-1_default_sort.bam      filtered_16S_17-1_default_sort.bam      filtered_16S_18-1_default_sort.bam     
----------
#chr01   28836   .       C       T       71.60   .       AC1=2;AC=6;AF1=1;AN=6;DP4=0,0,12,0;DP=12;FQ=-33;MQ=42;SF=0,5,13;VDB=5.960000e-02        GT:PL:GQ        1/1:67,6,0:10   .       .       .       .       1/1:152,21,0:39 .       .       .       .       .       .       .       1/1:92,9,0:16   .       .       .       .

Note: here .(dot) represent that the genotype is homozygous reference allele i.e. dot==CC allele. here 3 samples having ALTERNATIVE homozygous allele (1/1) and remaining 15 samples are .(dot). So what i want is that it should be SNP between between sample # 1 and aall 15 samples which are dot. sample # 6 and samples which are having dot and also for sample # 14 and all 15 samples which are dott.

but it should not be considered as SNP between samples 1, 6 and 14 becoz all are 1/1 means homozygous ALT alllele.

ADD REPLYlink modified 2.4 years ago by Pierre Lindenbaum129k • written 2.4 years ago by pramodkp970

here .(dot) represent that the genotype is homozygous reference

how am I supposed to know that ? in the VCF spec this is missing data . You didn't mention it anywhere.

The API handling the genotypes is here: https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/variantcontext/Genotype.html : I leave it as an exercice to modify my script.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Pierre Lindenbaum129k
0
gravatar for William
2.4 years ago by
William4.7k
Europe
William4.7k wrote:

This can be done 2 two bcftools commands piped together.

The first command selects the two samples of interest out of a VCF that can have multiple samples The second commands filters on that 4 alleles have to be called in total, of which 2 alternative and excludes variants were any of the two samples is heterozygous.

bcftools view -s sample1,sample2 | bcftools view -i 'AN==4 & AC=2'  --genotype ^het
ADD COMMENTlink written 2.4 years ago by William4.7k

I have VCF file named all18_sample.vcf . for first command i tried bcftools view -s sample1.bam,sample2.bam all18_sample.vcf

This is what it shows : <bcf_hdr_subsam> 2 samples in the list but not in BCF

There is no output .

ADD REPLYlink written 2.4 years ago by pramodkp970
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1411 users visited in the last hour