Hello All,
I have seven vcf files generated using different variant callers. These files are really big i.e., approximately 50 GB each as each vcf file contains SNP information for 50 samples. I want to represent common SNPs among all these variant callers. Because of the big size of files, I am not able to use UpSetR
plot package. Another goal is to use the files for DAPC
with R where we donot need SNP POS
information. So with this goal and to reduce the sizes of all files, I filtered missing SNP information, redundant SNPs and heterozygous SNPs using awk. The file size became manageable but the files are no longer vcf. The files are text files as shown in below lines:
Sample1 Sample2 Sample3-------Sample50
G G G------G
A T A------C
Now I am able to use the files for DAPC but I am not able to represent the common SNPs among all files using any plotting software. I will appreciate any suggestion to move further. Thank you!
did you use standard tools such as bcftools (isec/merge), rtgtool (vcfeval), vcftools in intersecting VCFs? @ evelyn
I did joint variant calling for all the 50 sorted bam files together. So I got a single vcf file with all information with no need of merging individual vcf files.