Question

Bcftools isec output only contains one sample name - how to identify which samples share the same variants

0

Entering edit mode

4.4 years ago

Linda ▴ 80

I have read this post: How bcftools isec works ? to further understand the output generated by bcftools isec.

I'm trying to find variants that are common by location in at least 2 files:

##bcftools_isecCommand=isec -p isec_out -n +2 UNC2FT198_Imtechella_halotolerans.fltq.vcf.gz UNC2FT2114_Imtechella_halotolerans.fltq.vcf.gz UNC2FT4154_Imtechella_halotolerans.fltq.vcf.gz UNC2FT55199_Imtechella_halotolerans.fltq.vcf.gz UNC2FT7534A_Imtechella_halotolerans.fltq.vcf.gz UNC2lu13_Imtechella_halotolerans.fltq.vcf.gz; Date=Wed Jun 23 15:57:36 2021

After generating the intersections of multiple vcf files I noticed in the last file containing the intersections only one sample name is present. How can I know from which samples the variants are common for?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  UNC2lu13_vs_combined.sorted.bam
Imtechella_halotolerans_length_3113269  169317  .   TG  TGTAAG  148 PASS    INDEL;IDV=10;IMF=0.625;DP=16;VDB=0.999554;SGB=-0.670168;MQSB=0.0871515;MQ0F=0;AC=1;AN=1;DP4=1,5,10,0;MQ=24  GT:PL   1:175,0
Imtechella_halotolerans_length_3113269  1450920 .   G   T   141 PASS    DP=46;VDB=0.0516381;SGB=-0.692562;RPB=0.369907;MQB=0.00470369;MQSB=0.000385694;BQB=0.917285;MQ0F=0;AC=1;AN=1;DP4=17,3,7,15;MQ=12    GT:PL   1:168,0
Imtechella_halotolerans_length_3113269  1455993 .   A   T   195 PASS    DP=30;VDB=0.63518;SGB=-0.693021;RPB=1;MQB=1;MQSB=0.0374987;BQB=1;MQ0F=0;AC=1;AN=1;DP4=0,1,25,2;MQ=28    GT:PL   1:222,0
Imtechella_halotolerans_length_3113269  1633169 .   T   C   60  PASS    DP=34;VDB=0.0688269;SGB=-0.693127;MQSB=0.346427;MQ0F=0;AC=1;AN=1;DP4=0,0,5,28;MQ=9  GT:PL   1:90,0
Imtechella_halotolerans_length_3113269  2005523 .   C   T   195 PASS    DP=64;VDB=0.0037203;SGB=-0.693144;RPB=0.727216;MQB=7.61251e-07;MQSB=9.95686e-06;BQB=0.86319;MQ0F=0;AC=1;AN=1;DP4=2,19,25,14;MQ=17   GT:PL   1:222,0

If it was such a case where I was looking to find variants common across all samples I would run the command with n =6 in this instance but that isn't the case.

I'm just wondering if there's a more illuminating way to identify common variants by position across multiple samples. Any insight greatly appreciated, thanks!

bcftools vcf isec • 2.4k views

ADD COMMENT • link updated 4.4 years ago by Kevin Blighe 89k • written 4.4 years ago by Linda ▴ 80

score 2 · Accepted Answer · 2021-06-23

Hi Linda,

To be honest, I have become so displeased with bcftools isec over the years such that I never use it anymore.

To get what you want, I would do the following:

1, Normalise (split multi-allelic calls + left-align indels) each VCF and set a unique identifier for `ID` field:

bcftools norm \
  -m-any \
  --check-ref w -f ref.fasta \
  UNC2FT198_Imtechella_halotolerans.fltq.vcf.gz | \
bcftools annotate \
  -x ID -I +'%CHROM:%POS:%REF:%ALT' -Oz
  > UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz ;
tabix -p vcf UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz ;

et cetera

If you genuinely don't care about the variant (base change), and just care about the position, then use

  -x ID -I +'%CHROM:%POS

2, merge all files together

bcftools merge \
  UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz \
  UNC2FT2114_Imtechella_halotolerans.fltq.norm.vcf.gz \
  .. .. -Oz \
  > merge.vcf.gz ;
tabix -p vcf merge.vcf.gz ;

3, produce tabular output of REF and ALT het and hom allele counts

see my code here:

How to get sample names and genotype for SNP in multi-sample VCF file

Granted, there are likely other ways to do this.

Kevin

1, Normalise (split multi-allelic calls + left-align indels) each VCF and set a unique identifier for ID field:

2, merge all files together

3, produce tabular output of REF and ALT het and hom allele counts

1, Normalise (split multi-allelic calls + left-align indels) each VCF and set a unique identifier for `ID` field: