Splitting A Vcf File
7
1
Entering edit mode
10.1 years ago
Sebastian ▴ 20

Hi i downloaded a VCF file conatins multiple genome data(Muliple sample)> i want to split the VCF file to each geome file(VCF file with 1 geome). I diidnt find any script. if you have any please share with me

vcf • 18k views
ADD COMMENT
0
Entering edit mode

afaik, encoding variants for multiple reference genomes in VCF is not supported, and therefore it is impossible to handle this case. It is not clear to me what you mean by "multiple genome data". Do you want to separate variant calls by sample? I don't know if that is possible or makes sense. Please provide and example of your file, so we can see how this concatenation is realized.

ADD REPLY
7
Entering edit mode
7.0 years ago

I know this is a very old question, but there's a very efficient way of doing this that hasn't been reported yet. hope it helps:

for file in *.vcf.gz; do
  for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done

EDIT: bcftools query -l lists all samples, so the fastest loop found be the following:

for file in *.vcf*; do
  for sample in `bcftools query -l $file`; do
    bcftools view -c1 -Oz -s $sample -o ${file/.vcf*/.$sample.vcf.gz} $file
  done
done
ADD COMMENT
1
Entering edit mode

Tried to use the above code with the new bcftools and keep getting message, -h is not a valid parameter

Any ideas?

ADD REPLY
1
Entering edit mode

-h IS a valid parameter. bcftools view -h outputs the header of a vcf.gz file. are you sure you're using latest bcftools version (currently 1.2, using htslib 1.2.1), or that you aren't having a typing error?

ADD REPLY
1
Entering edit mode

I can confirm that -h is a valid parameter. I was using an oder version of bcftools, it's working now

 

ADD REPLY
0
Entering edit mode

Hi Jorge,

Could you please write the command for my vcf file named (ALLsamples.vcf)?

Thank you
Bing

ADD REPLY
2
Entering edit mode

the "problem" with bcftools is that it needs the variants VCF file to be bgzip compressed and tabix indexed. before using the code above you should do the following:

bgzip ALLsamples.vcf
tabix -p vcf ALLsamples.vcf.gz
ADD REPLY
0
Entering edit mode

Hi Jorge,

I have used your script to separate my VCF file (consists of 96 individuals). For that I have followed the above method (produced mydata.vcf.gz and mydata.vcf.gz.tbi. files) before executing script. After running, I got an error "Failed to open mydata.vcf.gz.tbi: Success" but results were produced (separate files of 96 individuals). Could you let me know why this error comes?

Am using bcftools V1.2 (using htslib 1.2.1)

ADD REPLY
0
Entering edit mode

it looks like you are trying to work with both vcf.gz and vcf.gz.tbi files in the for loop. the reason why you still have results is that the vcf.gz ones work, and the vcf.gz.tbi don't. you may solve it by making sure the for loop is well defined, starting it as follows: for file in *.vcf.gz; do

ADD REPLY
0
Entering edit mode

Thanks for your reply and works fine without errors. My original VCF file have 8020 SNPs with consist of 96 samples. When analyzing the produced out files shows only 5300 SNPs per individuals. Could you let me the problem of this?

ADD REPLY
0
Entering edit mode

a multisample vcf file contains all the variants for all the samples. when you select a particular sample you will find that not all the variants (in all samples) do actually vary (in that sample). the option -c1 forces the variants to necessary vary on that particular sample in order to be in the output file. you may remove it if you don't mind having reference homozygous variants in your new vcf files.

ADD REPLY
0
Entering edit mode

Thanks for your clarification.

ADD REPLY
0
Entering edit mode

I tried the tabix command but I am getting this error:

[E::hts_open_format] Failed to open file vcf Could not read vcf

I am not able to understand the problem. Please reply

ADD REPLY
0
Entering edit mode

Share the command you used with us, so that we don't have to guess it.

ADD REPLY
0
Entering edit mode

tabix -p vcf ALLsamples.vcf.gz .....this one.

ADD REPLY
0
Entering edit mode

If ALLsamples.vcf is a well formed multisample file, and is previously bgzip compressed, tabix -p vcf ALLsamples.vcf.gz should index it without complaining at all. Unfortunately if that doesn't work for you you'll have to find it out yourself, as you're facing a local issue. Some ideas: is tabix perfectly installed? do you have permissions on those files? is the vcf file well formed? Good luck.

ADD REPLY
0
Entering edit mode

How do we check if the vcf file is well formed?

ADD REPLY
0
Entering edit mode

Thank you Jorge! It worked.

ADD REPLY
3
Entering edit mode
10.1 years ago

I am assuming that you mean that you have multiple samples represented in your VCF file and that you want to get one file per sample. See the vcftools package for some possibilities. If my assumption was incorrect, please edit your question with more details.

ADD COMMENT
2
Entering edit mode

vcftools has a perl script to allow this called vcf-subset

ADD REPLY
3
Entering edit mode
4.0 years ago
mplace ▴ 40

I know this is an old post, but this method modified from above, (Jorge ) is much faster.

Get list of sample names:

  for sample in `bcftools view -h MyData.vcf.gz | grep "^#CHROM" | cut -f10-`; do echo $sample; done > sampleNames.txt

split vcf files faster:

  parallel -a sampleNames.txt  bcftools view -c1 -s {} -Oz --threads 8 -o {}.vcf.gz MyData.vcf.gz

This will use all available cores on the system.

ADD COMMENT
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

thanks a lot for this idea. it is indeed much faster (3x at least on a local test splitting 3 exomes), and it can be condensed as follows:

bcftools query -l MyData.vcf.gz | parallel -a - \
bcftools view -c1 -s {} -Oz --threads 8 -o {}.vcf.gz MyData.vcf.gz
ADD REPLY
2
Entering edit mode
10.1 years ago
Casbon ★ 3.2k
cut -f1-9,n file.vcf

Where n is the column of the sample you want.

ADD COMMENT
2
Entering edit mode

You probably want to include the header in each new file:

for col in 10,11,12...
do
   (grep ^# file.vcf; grep -v ^# file.vcf | cut -f 1-9,$col) > file.$col.vcf
done

or something like that...

ADD REPLY
2
Entering edit mode

This approach will almost certainly leave you with some serious inconsistencies in your resulting VCF. For example, you may end up with lines that have a variant, but with no support for that variant in the genome that you cut out of the original file. You will have missing and incorrect ALT's. Depending on which info and format tags you are using, you may see a host of other issues as well. You will need to recalculate these values for each line.

ADD REPLY
0
Entering edit mode

No need to if the header doesn't include tabs, which it normally doesn't

ADD REPLY
1
Entering edit mode
6.7 years ago
peter ▴ 10

Have a look at our Differ app. It's free and allows you to split VCF files using a GUI on OS X.

Differ is available from http://www.diploid.com/differ

ADD COMMENT
0
Entering edit mode
9.4 years ago
user56 ▴ 300

I had a similar problem and I had to use windows :-(.

If you are working with small-ish VCF files you can use R to work with the data (e.g., split it)

To load the file use:

file='e:/d/genome/t300.txt'
v <- read.table(file,sep='\t',header = T,fileEncoding="utf-16")
str(v)

The UTF-16 encoding was particulary hard to troubleshoot. Eventually Notepad++ helped me to detect this encoding problem. It correctly ignores the header lines and detects column headers as well.

to remove the columns (except 1 genome) use this command:

v[11:ncol(v)]<-list(NULL)

in my case the file had 9 initial columns and column 10 had the first genome.

You can modify this to filter genomes 12,13, etc...

ADD COMMENT
0
Entering edit mode
10 weeks ago
samuelandjw ▴ 120

As of now, bcftools 1.12 has a plugin named split. To split the vcf file so that each sample has its own vcf file, just use:

bcftools +split input.vcf.gz -Oz -o vcf_per_sample

All split vcf files will be in the vcf_per_sample folder.

ADD COMMENT

Login before adding your answer.

Traffic: 1039 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6