Identifying Private SNPs between multi sample vcf files.
Entering edit mode
3.3 years ago
nataliagru1 ▴ 90

Dear Community,

Hope all is well. I am having difficulty finding the best way to quantify Private SNPs between my multi sample VCF files. For example, I have 110 samples in my VCF file that I generated via CohortCalling using GATK. I have separated the VCF by samples who are in the same genus.

So I now have 4 VCF files (populations) I would like to compare. I would like to know the total amount of private SNPs compared to each population.

However when I attempt to use command such as BCFTOOLS:

bcftools isec Genus1.vcf.gz Genus2.vcf.gz -p /dir/out

It outputs the correct files but is unable to identify shared or private sites between multisample VCF's.

When I used vcf-compare:

 vcf-compare -g Genus1.vcf.gz  Genus2.vcf.gz

it is only able to output the total number of SNPs. It cant discern any differences between the multi-sample VCF file.

Note: When I run these commands on VCF that contains only one sample these commands execute perfectly and output appropriate data.

Note: I have indexed my files with TABIX and have zipped them using bgzip.

Can anyone offer any guidance or help as to how to quantify total private snps in a multi-sample VCF file compared to another multisample VCF file?

Thank you for taking the time to read my post and for your help!

vcftools bcftools vcf-compare bcftools isec • 2.0k views
Entering edit mode
Entering edit mode

I would like to make an updated note. "bcftools isec" works as it should. It was unable to identify private SNPs between my multi-vcf files (Genus1.vcf vs. Genus2.vcf) because I had split these files originally from a vcf file that contained all species (Genusall.vcf). I split my vcf file based on genus using bcftools view.

For some reason bcftools isec cannot identify private or shared SNP with VCF files split using bcftools view. bcftools isec works fine when files are merged instead of split from a master VCF file.

Update: This issue was caused because I had split my samples from a multi-vcf file. bcftools isec was unable to differentiate between "sample1.vcf, sample2.vcf, sample3.vcf etc." split from a multi.vcf file using GATK. I had to generate vcf files separately for each sample then assess private SNPs using bcftools isec.

Entering edit mode

I also have a multi vcf file which I wish to find how many SNPs are found in each sample. If you have any idea how to do this please help.

In your case did you finally compare the gvcf files of each sample?


Login before adding your answer.

Traffic: 1769 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6