Say we take a 40x whole human genome BAM file of HiSeq reads (~100GB), call variants but do not annotate further, and create a VCF with every position called (even if that position matches the reference genome), then compress. How big will the VCF and BCF files be?
Under the assumption that each line will similar to this one:
chr1 249250621 . A A 22 PASS 0/0
This means each line uses at max 45 bytes. Times length of human genome this makes VCF file of maximum size around 125GB. Size of the header is not used in the calculation since it's insignificant compared to the rest of the file.
I don't know much about the BCF format and the effects of compression. A wild guess from the would be that the compression will reduce size of the file under 30GB.