I have been working on a project that has caused a bit of a headache for me. While I have made some progress, now the process is simply too slow to be reasonable.
I have a merged Vcf file made up of WGS Vcf files for around 1900 samples. The file size is around 1.4 TB. My goal is to use Plink to process this file to make the necessary binaries for a script that I am using, CookHLA. Given the huge size of this file, Plink will likely take an unreasonable amount of time. I want to isolate variants of a specific region on the Vcf file to make a smaller Vcf that will be processed more quickly.
I have read into others trying to do similar work and saw that you can use tabix to index based on region. I know the command that I would have to use:
tabix -r region_file input.vcf.gz #create index file
The region_file is formatted as: chr6 TAB start TAB end
My current disconnect is how to use this index file to shrink the Vcf file and then run it using Tabix to produce bed/bim/fam files via the following line of code:
plink2 --vcf in.vcf.gz --make-bed --out combined_subset --threads 4
I was interested in assistance in shrinking the Vcf file as that is what I am having difficulty figuring out right now.