Question: Extract subset of samples from multigenome vcf file
2
gravatar for MAPK
3.0 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I have a multigenome vcf file. Suppose the file has samples A to Z, but I want to extract the subset of samples B to G and extract a small vcf file. How can I make such subset vcf file?

vcf • 11k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by MAPK1.4k

@Jorge Amigo's answer in this thread would be recent: How To Split Multiple Samples In Vcf File Generated By Gatk?

ADD REPLYlink written 3.0 years ago by genomax64k

@genomax2 Thanks, but this only explains how to extract individual sample per file. Is there a way to input the list of samples I want to extract (for example, samples B,C,D,E,F and G) and get a subset file with these samples only?

ADD REPLYlink written 3.0 years ago by MAPK1.4k

I don't know how to do it in vcf format, but you can convert into plink format (plink --double-id --vcf your.vcf --recode --make-bed --out your_output), then from generated fam file select the individuals you want and extract them with(plink --bfile your_plink --keep list_of_individuals --recode --out your_output). Then you can convert back to vcf if you wish :x

ADD REPLYlink written 3.0 years ago by stolarek.ir580
6
gravatar for Jorge Amigo
3.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from https://samtools.github.io/bcftools/bcftools.html#view

bcftools view -s samplelist

or

bcftools view -S samplefile

would do the job. docs are your friends ;)

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Jorge Amigo11k
4
gravatar for MAPK
3.0 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I have created this bash loop to loop over files (by chromosome or any vcf file). Then using vcf-subset tool, I was able to extract the subset file. Here, sample.txt is the list of samples per line. No need to tabix or bgzip parent vcf files with this method, but is a bit slower.

for i in /path/dir/*.vcf; do
    vcf-subset -c sample.txt "$i" | bgzip  -c > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by MAPK1.4k
1

bcftools is faster than vcftools

for file in /path/dir/*.vcf; do
    bcftools view -Oz -S sample.txt $file > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Jorge Amigo11k
1

Although it is 10 times faster, the "problem" with bcftools is that it needs the variants VCF file to be bgzip compressed and tabix indexed. before using the code above you should do the following:

bgzip ALLsamples.vcf
tabix -p vcf ALLsamples.vcf.gz
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by MAPK1.4k

true, although it shouldn't be a problem

for file in /path/dir/*.vcf; do
    bgzip $file; tabix -p vcf $file.gz
    bcftools view -Oz -S sample.txt $file.gz > /get/inthis/dir/output_"${i##*/}"_.vcf.gz
done
ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Jorge Amigo11k
1
gravatar for Pierre Lindenbaum
3.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

GATK selectVariants https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php and option

--exclude_sample_file (file)

or

--sample_file (file)
ADD COMMENTlink written 3.0 years ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1111 users visited in the last hour