Does a merged VCF file (using bcftools merge) contain unique SNPs or INDELs from all merged VCF files
2
0
Entering edit mode
16 months ago
mohsamir2016 ▴ 30

I have 6 VCF files that contains SNPs only, were produced by GATK. Each VCF represent one individual animal from breed X, so they are biological replicates. I have also another 6 files from breed Y.

I have merged them using BCFtools merge bcftools merge -Oz L10A2_SNP.vcf.gz L10A_SNP.vcf.gz L10B_SNP.vcf.gz L10C_SNP.vcf.gz L10D_SNP.vcf.gz L10E_SNP.vcf.gz -o merged_L10_SNPs.vcf.gz --threads 16

When I checked the number of SNPs in the 6 file using

bcftools view -v snps RossA2_SNP.vcf.gz | grep -v -c '^#' 
bcftools view -v snps RossA_SNP.vcf.gz | grep -v -c '^#' 
bcftools view -v snps RossB_SNP.vcf.gz | grep -v -c '^#' 
bcftools view -v snps RossC_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossD_SNP.vcf.gz | grep -v -c '^#' 
bcftools view -v snps RossE_SNP.vcf.gz | grep -v -c '^#'

The numbers of SNPs were :

RossA2_SNP.vcf.gz: 221337
RossA_SNP.vcf.gz: 225504
RossB_SNP.vcf.gz: 280209
RossC_SNP.vcf.gz: 426710
RossD_SNP.vcf.gz: 271401
RossE_SNP.vcf.gz: 306445

and for the merged file

bcftools view -v snps merged_Ross_SNP.vcf.gz | grep -v -c '^#'

The numbers are as follow : 715116

Given the small number of SNPs is smaller, so I assume that the results in the merged file is a unique records that are common to all the 6 VCF files. ?

First question: Does the merged file contains non-duplicate SNPs from the 6 files ? Second question: If I am using bcftools isec, can I use the merged VCF file from breed X and Y to get the intersections, which would represent all the 6 replicates within each breed ?

Thanks

VCF • 1.3k views
ADD COMMENT
0
Entering edit mode
16 months ago
barslmn ★ 2.1k

Does the merged file contains non-duplicate SNPs from the 6 files ?

Bcftools offers control over the merge behavior with --merge parameter. I think by default it uses snp option and only creates multiallelic SNPs.

Second question: If I am using bcftools isec, can I use the merged VCF file from breed X and Y to get the intersections, which would represent all the 6 replicates within each breed ?

I think isec works with separate VCF files. If you want to compare X and Y as total you should merge each group and do the comparison if you want to compare every individual you should supply every sample separately.

Also, you might want to look into the norm command.

https://samtools.github.io/bcftools/bcftools.html#merge

bcftools view -v snps merged_Ross_SNP.vcf.gz | grep -v -c '^#'

Another thing you can run the view command with -H to skip the header.

My general advice would be to read the docs, and test out with smaller files to see how all these work.

ADD COMMENT
0
Entering edit mode
16 months ago
mohsamir2016 ▴ 30

Thanks a lot: related to this: Does the merged file contains non-duplicate SNPs from the 6 files ?

Bcftools offers control over the merge behavior with --merge parameter. I think by default it uses snp option and only creates multiallelic SNPs.

I think you got me wrong. My question isnot about the merge style, but rather the number of SNPs I obtain. when using the merged file, I obtained less number of SNPs than the the number I obtained if you sum all the 6-VCF files. So I assumed that this merged file contains the unique SNPs not just an addition of all together. ? any idea ?

ADD COMMENT
0
Entering edit mode

You can look into your merged file and individual files to see what's happening.

ADD REPLY

Login before adding your answer.

Traffic: 1406 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6