Bcftools isec output only contains one sample name - how to identify which samples share the same variants
1
0
Entering edit mode
2.8 years ago
Linda ▴ 80

I have read this post: How bcftools isec works ? to further understand the output generated by bcftools isec.

I'm trying to find variants that are common by location in at least 2 files:

##bcftools_isecCommand=isec -p isec_out -n +2 UNC2FT198_Imtechella_halotolerans.fltq.vcf.gz UNC2FT2114_Imtechella_halotolerans.fltq.vcf.gz UNC2FT4154_Imtechella_halotolerans.fltq.vcf.gz UNC2FT55199_Imtechella_halotolerans.fltq.vcf.gz UNC2FT7534A_Imtechella_halotolerans.fltq.vcf.gz UNC2lu13_Imtechella_halotolerans.fltq.vcf.gz; Date=Wed Jun 23 15:57:36 2021

After generating the intersections of multiple vcf files I noticed in the last file containing the intersections only one sample name is present. How can I know from which samples the variants are common for?

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  UNC2lu13_vs_combined.sorted.bam
Imtechella_halotolerans_length_3113269  169317  .   TG  TGTAAG  148 PASS    INDEL;IDV=10;IMF=0.625;DP=16;VDB=0.999554;SGB=-0.670168;MQSB=0.0871515;MQ0F=0;AC=1;AN=1;DP4=1,5,10,0;MQ=24  GT:PL   1:175,0
Imtechella_halotolerans_length_3113269  1450920 .   G   T   141 PASS    DP=46;VDB=0.0516381;SGB=-0.692562;RPB=0.369907;MQB=0.00470369;MQSB=0.000385694;BQB=0.917285;MQ0F=0;AC=1;AN=1;DP4=17,3,7,15;MQ=12    GT:PL   1:168,0
Imtechella_halotolerans_length_3113269  1455993 .   A   T   195 PASS    DP=30;VDB=0.63518;SGB=-0.693021;RPB=1;MQB=1;MQSB=0.0374987;BQB=1;MQ0F=0;AC=1;AN=1;DP4=0,1,25,2;MQ=28    GT:PL   1:222,0
Imtechella_halotolerans_length_3113269  1633169 .   T   C   60  PASS    DP=34;VDB=0.0688269;SGB=-0.693127;MQSB=0.346427;MQ0F=0;AC=1;AN=1;DP4=0,0,5,28;MQ=9  GT:PL   1:90,0
Imtechella_halotolerans_length_3113269  2005523 .   C   T   195 PASS    DP=64;VDB=0.0037203;SGB=-0.693144;RPB=0.727216;MQB=7.61251e-07;MQSB=9.95686e-06;BQB=0.86319;MQ0F=0;AC=1;AN=1;DP4=2,19,25,14;MQ=17   GT:PL   1:222,0

If it was such a case where I was looking to find variants common across all samples I would run the command with n =6 in this instance but that isn't the case.

I'm just wondering if there's a more illuminating way to identify common variants by position across multiple samples. Any insight greatly appreciated, thanks!

bcftools vcf isec • 1.6k views
ADD COMMENT
1
Entering edit mode
2.8 years ago

Hi Linda,

To be honest, I have become so displeased with bcftools isec over the years such that I never use it anymore.

To get what you want, I would do the following:

1, Normalise (split multi-allelic calls + left-align indels) each VCF and set a unique identifier for ID field:

bcftools norm \
  -m-any \
  --check-ref w -f ref.fasta \
  UNC2FT198_Imtechella_halotolerans.fltq.vcf.gz | \
bcftools annotate \
  -x ID -I +'%CHROM:%POS:%REF:%ALT' -Oz
  > UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz ;
tabix -p vcf UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz ;

et cetera

If you genuinely don't care about the variant (base change), and just care about the position, then use

  -x ID -I +'%CHROM:%POS

2, merge all files together

bcftools merge \
  UNC2FT198_Imtechella_halotolerans.fltq.norm.vcf.gz \
  UNC2FT2114_Imtechella_halotolerans.fltq.norm.vcf.gz \
  .. .. -Oz \
  > merge.vcf.gz ;
tabix -p vcf merge.vcf.gz ;

3, produce tabular output of REF and ALT het and hom allele counts

see my code here:



Granted, there are likely other ways to do this.

Kevin

ADD COMMENT
0
Entering edit mode

Brilliant thanks! I'm new to snp calling so am glad someone has had this dilemma before.

ADD REPLY
1
Entering edit mode

Cool, let me know if it works. The code for part 3 is well-tested but is quite an eye-sore.

ADD REPLY

Login before adding your answer.

Traffic: 2901 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6