I am currently working with 1000 Genomes latest released data, which is a large >60GB .vcf.gz file. I am having difficulties to process it as I used to process .vcf.gz files before, and for that reason I would like to split it into smaller files.
my first idea is to split it into chromosomes, but I have thoroughly checked the vcftools site and I haven't find any valid way of doing such split. I know I can extract chromosome lines with vcf tools, but if I query this large file for each chromosome wouldn't it be accessed (hence read) 22 times for the 22 chromosomes I want?
I have a home-made perl script that is capable of doing it by parsing the entire file and checking each line's contents, but I'm pretty sure it will be slow. I just wanted to know if anyone would like to suggest anything more elegant rather than starting to process the file and waiting for the results.
simple and elegant, yet powerful. but for the A+ I would need to zip the results as I generate them, and not at the end of the whole process. would another pipe mentioning gzip fit in that one-line command?