Obtain one vcf file of shared SNPs from input files with different samples using vcf-isec (vcftools)
1
3
Entering edit mode
5.2 years ago
weedy23 ▴ 70

I am new to Linux and programming and am trying to use vcftools. I have 3 vcf files; each one is a different population (i.e. with no shared individuals between the files). I am trying to use vcf-isec to merge the 3 files and end up with one vcf file that contains only the SNPs that are present in all 3 files. I have tried the following code:

vcf-isec -n =3 file1.vcf.gz file2.vcf.gz file3.vcf.gz -f -c > CombinedPops.vcf

and without -c :

vcf-isec -n =3 file1.vcf.gz file2.vcf.gz file3.vcf.gz -f > CombinedPops.vcf

but I keep ending up with one file with only the individuals from the first input file. It also gives me a warning that "the number of sample columns is different", but I read in another post that -f forces vcf-isec to output the file regardless. Could this warning be why I can't get a file with ALL the individuals listed? Can vcf-isec even do this?

Although I have read the vcf-isec documentation, I am still not sure exactly what the difference between the -c and -o commands are, which may be part of my problem.

Any help is greatly appreciated!

vcftools vcf • 5.0k views
2
Entering edit mode
5.2 years ago
venu 6.8k
vcf-isec -n +3 A.vcf.gz B.vcf.gz C.vcf.gz | bgzip -c > out.vcf.gz


Which gives a vcf file containing variants present in all the input vcf files (shared by all 3 VCF files). -f flag should be included to force the program to run over the different column name errors. On the other hand if you want to merge 3 vcf files into single vcf file use vcf-merge.

I don't think this program of vcftools can separate SNPs, Indels but you can use vcf-annotate

zcate file.vcf.gz | vcf-annotate --fill-type | bgzip -c > out.vcf.gz


This program includes variant TYPE field in the last column of your vcf file. Then create a new VCF file with SNPs. And finally I don't find any -o flag with these programs.

1
Entering edit mode

Hi venu, thanks for your help. The vcf-isec code you wrote is basically what I did, but I just specified exactly 3 files rather than 3 or more, and an uncompressed output instead. However, this gave me a file with only the individuals in the first file in it (although it did give me the loci found only in all three files). I have looked at vcf-merge but I think this produces a file with ALL loci? I only want the ones common to all the specified files. I don't have indels, my data is simple SNP data, but I will look further at vcf-annotate. The -o flag is mentioned here: http://vcftools.sourceforge.net/perl_module.html#vcf-isec, under "Read More".

0
Entering edit mode

My bad. I edited. So you need SNPs shared by all 3 files (same chr#, position, base change ..etc)? but not as vcf-isec do?

0
Entering edit mode

Yep exactly.

0
Entering edit mode

Hi weedy23, did you finally manage to solve the issue? I am having the same problem as you do...

1
Entering edit mode

Hi, sorry I only just saw your comment. I ended up using vcf-merge instead. However, it included ALL loci in the output file, not just loci present in all the files. So I had to go through the output file and delete all the loci that were missing for one or more of the populations. Not ideal but it didn't take too long in the end if you sort the file. Good luck!

0
Entering edit mode

how to extract specific variants for A? following command is correct?

vcf-isec -c A.vcf.gz B.vcf.gz C.vcf.gz > specific_for_A.vcf