Question: How to split VCF files per variant, 1000 variants per file?
4
gravatar for rickyflintoff
6.5 years ago by
rickyflintoff80 wrote:

I wanted to know if there is a tool that would split a VCF file into n number of smaller VCF files?

Per number of variants, say 1000 variants in each split VCF output.

vcf • 7.6k views
ADD COMMENTlink modified 15 months ago by zx87549.0k • written 6.5 years ago by rickyflintoff80

Split based on what ?? chromosome or sample or sth else

ADD REPLYlink written 6.5 years ago by Ashutosh Pandey12k

Split on number of variants. Say 1000 variants per file or something like that.

ADD REPLYlink written 6.5 years ago by rickyflintoff80
13
gravatar for Chris Miller
6.5 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 1000 lines
split -l 1000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
ADD COMMENTlink written 6.5 years ago by Chris Miller21k

Hello Chris, 

I have been looking for similar thing, but i would like split based on scaffold numbers as my vcf file has it so, (each scaffold in one vcf file), really appreciate you help in this regard. 

Thank you

ADD REPLYlink written 4.5 years ago by krp000120
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#split into chunks by chromosome
grep -v "^#" my.vcf | cut -f 1 | sort | uniq | while read i;do
cat header >$i.vcf
grep -w ^$i >>$i.vcf
done
rm -f header
ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Chris Miller21k

I have split my vcf using these commands. The files were fine. I was able to read them in R. But now I am facing problem in merging them. I have tried using joinx vcf-merge and also vcftools. I will really appreciate your help.

Using joinx vcf-merge xaa.vcf xab.vcf -o merged.vcf I get this error:

Error while parsing header of xaa.vcf: Failed to extract token while parsing custom type ID=VT,Number=.,Type=String,Description="Alternate allele type. S=SNP, M=MNP, I=Indel">

ADD REPLYlink modified 22 months ago • written 22 months ago by humeira.tayyab10
1
gravatar for Devon Ryan
6.5 years ago by
Devon Ryan94k
Freiburg, Germany
Devon Ryan94k wrote:

If you just want it split arbitrarily, rather than by chromosome, just use the split command. if you want N files rather than N lines per file, use "wc -l" first to get the number of lines in the file and then just do the division followed by "split -l ...". This could simply be scripted.

See also this thread on stackoverflow, which contains an example script to do exactly this.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Devon Ryan94k

Thank you, but this wouldn't add headers.

ADD REPLYlink written 6.5 years ago by rickyflintoff80

Indeed, I forgot those! ashutoschmits' link gives the variant of my suggestion to use.

ADD REPLYlink written 6.5 years ago by Devon Ryan94k
0
gravatar for Rm
6.5 years ago by
Rm7.9k
Danville, PA
Rm7.9k wrote:

Asuming header: top 20 lines of VCF: and spliting every 1000 lines after header

head -20 test.VCF | tee header subset.1.VCF >/dev/null ; awk -v header="`cat header`" -v count=1 '( (NR>20) && !( (NR-1) % 1000) ) { count++ ; print header >"subset." count ".VCF";} {print $0 >>"subset." count ".VCF";}'  test.VCF
ADD COMMENTlink written 6.5 years ago by Rm7.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1800 users visited in the last hour