Isolate a Region in a Vcf File to make a Smaller Vcf File
Entering edit mode
16 months ago
J ▴ 20

I have been working on a project that has caused a bit of a headache for me. While I have made some progress, now the process is simply too slow to be reasonable.

I have a merged Vcf file made up of WGS Vcf files for around 1900 samples. The file size is around 1.4 TB. My goal is to use Plink to process this file to make the necessary binaries for a script that I am using, CookHLA. Given the huge size of this file, Plink will likely take an unreasonable amount of time. I want to isolate variants of a specific region on the Vcf file to make a smaller Vcf that will be processed more quickly.

I have read into others trying to do similar work and saw that you can use tabix to index based on region. I know the command that I would have to use:

tabix -r region_file input.vcf.gz #create index file

The region_file is formatted as: chr6 TAB start TAB end

My current disconnect is how to use this index file to shrink the Vcf file and then run it using Tabix to produce bed/bim/fam files via the following line of code:

plink2 --vcf in.vcf.gz --make-bed --out combined_subset --threads 4

I was interested in assistance in shrinking the Vcf file as that is what I am having difficulty figuring out right now.

plink vcftools tabix • 716 views
Entering edit mode

I'll try this too, thanks

Entering edit mode

index the vcf

bcftools index input.vcf.gz

query the indexed bcf.

bcftools view -O z -o subset.vcf.gz --regions-file intervals.bed input.vcf.gz 
Entering edit mode

I'll try this out, thanks!


Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6