Isolate a Region in a Vcf File to make a Smaller Vcf File

0

Entering edit mode

2.6 years ago

J ▴ 20

I have been working on a project that has caused a bit of a headache for me. While I have made some progress, now the process is simply too slow to be reasonable.

I have a merged Vcf file made up of WGS Vcf files for around 1900 samples. The file size is around 1.4 TB. My goal is to use Plink to process this file to make the necessary binaries for a script that I am using, CookHLA. Given the huge size of this file, Plink will likely take an unreasonable amount of time. I want to isolate variants of a specific region on the Vcf file to make a smaller Vcf that will be processed more quickly.

I have read into others trying to do similar work and saw that you can use tabix to index based on region. I know the command that I would have to use:

tabix -r region_file input.vcf.gz #create index file

The region_file is formatted as: chr6 TAB start TAB end

My current disconnect is how to use this index file to shrink the Vcf file and then run it using Tabix to produce bed/bim/fam files via the following line of code:

plink2 --vcf in.vcf.gz --make-bed --out combined_subset --threads 4

I was interested in assistance in shrinking the Vcf file as that is what I am having difficulty figuring out right now.

plink vcftools tabix • 1.3k views

ADD COMMENT • link 2.6 years ago by J ▴ 20

0

Entering edit mode

Vcftools: Filtering By Multiple Regions (--Positions Flag?)

ADD REPLY • link 2.6 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

I'll try this too, thanks

ADD REPLY • link 2.6 years ago by J ▴ 20

0

Entering edit mode

index the vcf

bcftools index input.vcf.gz

query the indexed bcf.

bcftools view -O z -o subset.vcf.gz --regions-file intervals.bed input.vcf.gz