Question: How bcftools isec works ?
1
gravatar for sacha
5.2 years ago by
sacha1.9k
France
sacha1.9k wrote:

I don't understand pretty well how bcftools isec works ?
When I try with this command , to make the interesection between two file

bcftools isec -d output/ A.vcf.gz B.vcf.gz 
I get 3 files... instead one. 

I think I have to use -n= parameter... but it's not really clear..  Could you help me to understand with a simple example like :

A.vcf 
A
B
C
B.vcf 
B
D
E

Intersect should have :   B

 

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by sacha1.9k

Good! It's now really clear! Thanks you very much!
 

ADD REPLYlink written 5.2 years ago by sacha1.9k

Hi all, what if we have more than two vcf files to compare, say 10 vcf files ?

I appreciate your help, thank you.

ADD REPLYlink written 3.7 years ago by abedkurdi1030

For intersecting more than two vcf files, you can use -n parameter (https://samtools.github.io/bcftools/bcftools.html#isec):

bcftools isec -n=3 A.vcf.gz B.vcf.gz C.vcf.gz
ADD REPLYlink written 3.7 years ago by Evgeniia Golovina1.0k
17
gravatar for Evgeniia Golovina
5.2 years ago by
New Zealand
Evgeniia Golovina1.0k wrote:

Hi,

Ok, for example, you have two files:
1) isec.a.vcf:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##test=<xx=A,yy=B,zz=C>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##readme=AAAAAA
##readme=BBBBBB
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    A
1    3062915    .    GTTT    G    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3062915    .    G    T    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3106154    .    CAAA    C    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3106154    .    C    T,CT    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3157410    .    GA    G    628    q10    DP=21;AN=2;AC=2    GT:GQ:DP    1/1:21:21
1    3162006    .    GAA    G    1016    PASS    DP=22;AN=2;AC=1    GT:GQ:DP    0/1:212:22
1    3177144    .    GT    G    727    PASS    DP=30;AN=2;AC=1    GT:GQ:DP    0/1:150:30
1    3184885    .    TAAAA    TA,T    246    PASS    DP=10;AN=2;AC=1,1    GT:GQ:DP    1/2:12:10
2    3199812    .    G    GTT,GT    481    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:322:26
3    3212016    .    CTT    C,CT    565    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:91:26
4    3212016    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31
4    3258448    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31

2) isec.b.vcf:

##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q20,Description="Mapping quality below 20">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    B
1    3062915    .    G    A    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3062915    .    GTTT    GT    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3106154    .    C    T    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3106154    .    CAAAA    C    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3157410    .    GA    G    249    PASS    DP=11;AN=2;AC=1    GT:GQ:DP    0/1:49:11
1    3162006    .    GAA    G    663    PASS    DP=19;AN=2;AC=1    GT:GQ:DP    0/1:589:19
1    3177144    .    GT    G    460    PASS    DP=24;AN=2;AC=1    GT:GQ:DP    0/1:236:24
1    3184885    .    TAAA    T    598    PASS    DP=16;AN=2;AC=1    GT:GQ:DP    0/1:435:16
2    3188209    .    GA    G    162    .    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:162:15
3    3199812    .    G    GTT,GT    353    PASS    DP=19;AN=2;AC=1,1    GT:GQ:DP    1/2:188:19
4    3212016    .    CTT    C    677    q20    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:158:15

1. You know, that, first of all you need to bgzip them:

bgzip isec.a.vcf > isec.a.vcf.gz

bgzip isec.b.vcf > isec.a.vcf.gz

2. You should to build indexes for them:

bcftools index isec.a.vcf.gz > isec.a.vcf.gz.csi

bcftools index isec.b.vcf.gz > isec.b.vcf.gz.csi

3. Try to run isec:

bcftools isec isec.a.vcf.gz isec.b.vcf.gz -p dir

In your dir, you shouldsee README and three output files. README tells you a little about these three files. Let's look at outputs for more details:

So, I got 0000.vcf, 0001.vcf and 0002.vcf files:

0000.vcf file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##test=<xx=A,yy=B,zz=C>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##readme=AAAAAA
##readme=BBBBBB
##contig=<ID=1,length=2147483647>
##contig=<ID=2,length=2147483647>
##contig=<ID=3,length=2147483647>
##contig=<ID=4,length=2147483647>
##bcftools_isecVersion=1.1+htslib-1.1
##bcftools_isecCommand=isec -p /Users/EVGENIIA_GOLOVINA/Desktop/mp/ala isec.a.vcf.gz isec.b.vcf.gz
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    A
1    3062915    .    GTTT    G    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3062915    .    G    T    1806    q10    DP=35;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:409:35:-20,-5,-20
1    3106154    .    CAAA    C    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3106154    .    C    T,CT    1792    PASS    DP=32;AN=2;AC=1    GT:GQ:DP    0/1:245:32
1    3184885    .    TAAAA    TA,T    246    PASS    DP=10;AN=2;AC=1,1    GT:GQ:DP    1/2:12:10
2    3199812    .    G    GTT,GT    481    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:322:26
3    3212016    .    CTT    C,CT    565    PASS    DP=26;AN=2;AC=1,1    GT:GQ:DP    1/2:91:26
4    3212016    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31
4    3258448    .    TACACACAC    T    325    PASS    DP=31;AN=2;AC=1    GT:GQ:DP    0/1:325:31

0001.vcf file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q20,Description="Mapping quality below 20">
##contig=<ID=1,assembly=b37,length=249250621>
##contig=<ID=2,assembly=b37,length=243199373>
##contig=<ID=3,assembly=b37,length=198022430>
##contig=<ID=4,assembly=b37,length=191154276>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_isecVersion=1.1+htslib-1.1
##bcftools_isecCommand=isec -p /Users/EVGENIIA_GOLOVINA/Desktop/mp/ala isec.a.vcf.gz isec.b.vcf.gz
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    B
1    3062915    .    G    A    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3062915    .    GTTT    GT    376    q20    DP=14;DP4=1,2,3,4;AN=2;AC=1    GT:GQ:DP:GL    0/1:376:14:-10,0,-10
1    3106154    .    C    T    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3106154    .    CAAAA    C    677    PASS    DP=15;AN=2;AC=1    GT:GQ:DP:GL    0/1:277:15:-10,0,-10
1    3184885    .    TAAA    T    598    PASS    DP=16;AN=2;AC=1    GT:GQ:DP    0/1:435:16
2    3188209    .    GA    G    162    .    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:162:15
3    3199812    .    G    GTT,GT    353    PASS    DP=19;AN=2;AC=1,1    GT:GQ:DP    1/2:188:19
4    3212016    .    CTT    C    677    q20    DP=15;AN=2;AC=1    GT:GQ:DP    0/1:158:15

0002.vcf file:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FILTER=<ID=q10,Description="Quality below 10">
##test=<xx=A,yy=B,zz=C>
##reference=file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##readme=AAAAAA
##readme=BBBBBB
##contig=<ID=1,length=2147483647>
##contig=<ID=2,length=2147483647>
##contig=<ID=3,length=2147483647>
##contig=<ID=4,length=2147483647>
##bcftools_isecVersion=1.1+htslib-1.1
##bcftools_isecCommand=isec -p /Users/EVGENIIA_GOLOVINA/Desktop/mp/ala isec.a.vcf.gz isec.b.vcf.gz
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    A
1    3157410    .    GA    G    628    q10    DP=21;AN=2;AC=2    GT:GQ:DP    1/1:21:21
1    3162006    .    GAA    G    1016    PASS    DP=22;AN=2;AC=1    GT:GQ:DP    0/1:212:22
1    3177144    .    GT    G    727    PASS    DP=30;AN=2;AC=1    GT:GQ:DP    0/1:150:30

So, if you look at output files, you'd see that the first file contains variants that unique for the first input vcf file (isec.a.vcf). The second output contains variants unique for the second input file (isec.b.vcf). The third file contains variants intersected for both first and second input files (isec.a.vcf and isec.b.vcf).

Hope, it helps. Sorry for the long text ;)

 
ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Evgeniia Golovina1.0k
3
gravatar for iraun
5.2 years ago by
iraun3.7k
Norway
iraun3.7k wrote:

I've never use bcftools isec for intersecting two or more vcf files but, have you tried something like this?

bcftools isec -p dir -n=2 A.vcf.gz B.vcf.gz

Using -n=2 argument, in principle, the output will be the common variants for the two input files. I guess you have already read this documentation about bcftools (https://samtools.github.io/bcftools/bcftools.html#isec), but just in case that is the link.

If you are not able to get your desired output, you can use vcf-isec (I usually work with this one). It is quite similar to bcftools. In this case the command should be:

vcf-isec -n =2 A.vcf.gz B.vcf.gz | bgzip -c > out.vcf.gz

Hope it helps.

 

 

ADD COMMENTlink written 5.2 years ago by iraun3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1791 users visited in the last hour