Hello.
I have 708 total genomic interval gvcfs. I have performed GATK GenotypeGVCFs and then SelectVariants on genomic intervals and created VCFs. I have noticed significant file size decrease in one of the genomic intervals between GenotypeGVCFs and SelectVariants outputs. I have done several things to see what is the reason for the file size change.
chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz >>>>> 2.1GB
$ wc chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
20064 23261679 2204917451 chr1-4410000_chr1-8821000.GenotypeGVCFs.vcf.gz
chr1-4410000_chr1-8821000.SelectVariants.vcf.gz. >>>>> 698MB
$ wc chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
1920129 14569274 731773426 chr1-4410000_chr1-8821000.SelectVariants.vcf.gz
What is this file difference comming from? What is happening in the in the vcf from the SelectVariants?
One more thing: It is weird that the unmerged 708 VCFs total size for GenotypeGVCFs and SelectVariants are 450GB and 420GB respectively. However. after merging with MergeVcfs with compression level 6, the both merged vcf file size of GenotypeGVCFs and SelectVariants are 261GB. Any thought on this?
well, SelectVariants is used to filter out variants, so unless you don't filter anything with SelectVariants, I don't understand where is the problem.
wcis not the right tool to get the size of a binary file, just use , for example,ls -l