Extract subset of samples from multigenome vcf file
3
7
Entering edit mode
8.0 years ago
MAPK ★ 2.1k

I have a multigenome vcf file. Suppose the file has samples A to Z, but I want to extract the subset of samples B to G and extract a small vcf file. How can I make such subset vcf file?

vcf • 40k views
ADD COMMENT
0
Entering edit mode

@Jorge Amigo's answer in this thread would be recent: How To Split Multiple Samples In Vcf File Generated By Gatk?

ADD REPLY
0
Entering edit mode

@genomax2 Thanks, but this only explains how to extract individual sample per file. Is there a way to input the list of samples I want to extract (for example, samples B,C,D,E,F and G) and get a subset file with these samples only?

ADD REPLY
0
Entering edit mode

I don't know how to do it in vcf format, but you can convert into plink format (plink --double-id --vcf your.vcf --recode --make-bed --out your_output), then from generated fam file select the individuals you want and extract them with(plink --bfile your_plink --keep list_of_individuals --recode --out your_output). Then you can convert back to vcf if you wish :x

ADD REPLY
11
Entering edit mode
8.0 years ago

from https://samtools.github.io/bcftools/bcftools.html#view

bcftools view -s samplelist

or

bcftools view -S samplefile

would do the job. docs are your friends ;)

ADD COMMENT
5
Entering edit mode
8.0 years ago
MAPK ★ 2.1k

I have created this bash loop to loop over files (by chromosome or any vcf file). Then using vcf-subset tool, I was able to extract the subset file. Here, sample.txt is the list of samples per line. No need to tabix or bgzip parent vcf files with this method, but is a bit slower.

for i in /path/dir/*.vcf; do
    vcf-subset -c sample.txt "$i" | bgzip  -c > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD COMMENT
1
Entering edit mode

bcftools is faster than vcftools

for file in /path/dir/*.vcf; do
    bcftools view -Oz -S sample.txt $file > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD REPLY
1
Entering edit mode

Although it is 10 times faster, the "problem" with bcftools is that it needs the variants VCF file to be bgzip compressed and tabix indexed. before using the code above you should do the following:

bgzip ALLsamples.vcf
tabix -p vcf ALLsamples.vcf.gz
ADD REPLY
0
Entering edit mode

true, although it shouldn't be a problem

for file in /path/dir/*.vcf; do
    bgzip $file; tabix -p vcf $file.gz
    bcftools view -Oz -S sample.txt $file.gz > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD REPLY
3
Entering edit mode
8.0 years ago

GATK selectVariants https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php and option

--exclude_sample_file (file)

or

--sample_file (file)
ADD COMMENT
0
Entering edit mode

What would be the equivalent option in current GATK(4.1.2 or latest)? Is it --sample-name to extract samples from the wanted list?

ADD REPLY

Login before adding your answer.

Traffic: 2609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6