Splitting vcf files to individual samples
5
1
Entering edit mode
4.4 years ago
Hadeel ▴ 10

Hi everyone. I have a vcf file ( .vcf.gz format) containing variants for multiple samples and I would like to split them to individual vcf files so that each file contains variants from one sample only. Does anyone have an easy, quick, reliable way for doing this? Thanks

vcf • 12k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
5
Entering edit mode
4.4 years ago

Might be a better way for this, but the following for loop should work:

for sample in `bcftools query -l yourfile.vcf`
do
java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V yourfile.vcf -o ${sample}.vcf -sn $sample -env ef
done

Or with vcftools

for sample in `bcftools query -l yourfile.vcf`
do
vcf-subset --exclude-ref -c $sample yourfile.vcf > ${sample}.vcf
done

Alternatively, using gnu-parallel

bcftools query -l yourfile.vcf | parallel -j 8 'java -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V yourfile.vcf -o {}.vcf -sn {} -env ef'

or with vcftools

bcftools query -l yourfile.vcf | parallel -j 8 'vcf-subset --exclude-ref -c {} yourfile.vcf > {}.vcf'

with 8 the number of processes run in parallel, to be adapted to your system.

ADD COMMENT
4
Entering edit mode
4.1 years ago

already stated here and here:

for file in *.vcf*; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done
ADD COMMENT
0
Entering edit mode

Why did I break it down.All the 0/0 columns are automatically eliminated, right?I'm just left with 0/1 and 1/1

ADD REPLY
0
Entering edit mode

Exactly. If you divide a multisample file you'll find variants that appeared in other samples but not in a particular one, therefore you may be interested in private variants only. This code does indeed remove all other samples' variants that would be 0/0 in individual samples.

ADD REPLY
0
Entering edit mode

Excuse me, if I want to get the whole data after splitting, how should I adjust this code?

ADD REPLY
1
Entering edit mode

Just remove the -c1 option, which stands for "1 minimum non-reference allele".

ADD REPLY
0
Entering edit mode

Thank you very much.However, I have one more question.Splitting by sample is equivalent to splitting by column.If I have too much data on a chromosome site, I want to break it down into several groups for training.How should the code be adjusted?

ADD REPLY
0
Entering edit mode

The for sample loop forces the file to be divided in all individual samples present in the file. If you have a list of samples you'd like to restrict the output to, then you only have to modify that loop, such as

for sample in sample1 sample2 sample3; do

or as

for sample in $(cat list_of_samples.txt); do

or, if you want to generate a new multsample vcf with just a few samples in it, you can remove the for sample loop completely and use a single bcftools command

bcftools view -S list_of_samples.txt -Oz -o subset.vcf.gz large.vcf.gz
ADD REPLY
2
Entering edit mode
4.4 years ago

vcftools

vcf-subset --exclude-ref -c sample1 in.vcf > out.vcf

vcf-subset --exclude-ref -c sample1,sample2 in.vcf > out.vcf
ADD COMMENT
0
Entering edit mode

Thanks. I tried this but got this error message: "Can't locate Vcf.pm @INC" Any idea how to get round this?

ADD REPLY
2
Entering edit mode

You run the following line or add it to .bashrc and then source .bashrc:

export PERL5LIB=/path/to/installation_dir/vcftools_0.1.13/perl/

ADD REPLY
0
Entering edit mode

Both commands worked perfectly! Thanks!

ADD REPLY
0
Entering edit mode

It means that the tool can't find a perl module. You'll have to install it, but my knowledge of perl is quite limited. Someone else can probably help you better with that, or you can have a look at an alternative (GATK) solution in my post.

ADD REPLY
2
Entering edit mode
4.4 years ago

I wrote one for Individual VCF files from main VCF file

$   curl -sL "https://raw.githubusercontent.com/arq5x/bedtools2/bc2f97d565c36a82c1a0b12f570fed4398001e5f/test/map/test.vcf" |\
    java -jar dist/biostar130456.jar -x -z -p "sample.__SAMPLE__.vcf.gz" 
sample.NA00003.vcf.gz
sample.NA00001.vcf.gz
sample.NA00002.vcf.gz
ADD COMMENT
0
Entering edit mode
4.4 years ago
Ron ★ 1.0k

Check out these posts : Splitting A Vcf File

Split a VCF file into individual sample files

ADD COMMENT

Login before adding your answer.

Traffic: 2916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6