Question: How to split VCF files per variant, 1000 variants per file?
3
gravatar for rickyflintoff
5.8 years ago by
rickyflintoff60 wrote:

I wanted to know if there is a tool that would split a VCF file into n number of smaller VCF files?

Per number of variants, say 1000 variants in each split VCF output.

vcf • 6.5k views
ADD COMMENTlink modified 6 months ago by zx87547.3k • written 5.8 years ago by rickyflintoff60

Split based on what ?? chromosome or sample or sth else

ADD REPLYlink written 5.8 years ago by Ashutosh Pandey11k

Split on number of variants. Say 1000 variants per file or something like that.

ADD REPLYlink written 5.8 years ago by rickyflintoff60
12
gravatar for Chris Miller
5.8 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#grab the non header lines
grep -v "^#" my.vcf >variants
#split into chunks with 1000 lines
split -l 1000 variants
#reattach the header to each and clean up
for i in x*;do cat header $i >$i.vcf && rm -f $i;done
rm -f header variants
ADD COMMENTlink written 5.8 years ago by Chris Miller20k

Hello Chris, 

I have been looking for similar thing, but i would like split based on scaffold numbers as my vcf file has it so, (each scaffold in one vcf file), really appreciate you help in this regard. 

Thank you

ADD REPLYlink written 3.7 years ago by krp000120
#grab the header
head -n 10000 my.vcf | grep "^#" >header
#split into chunks by chromosome
grep -v "^#" my.vcf | cut -f 1 | sort | uniq | while read i;do
cat header >$i.vcf
grep -w ^$i >>$i.vcf
done
rm -f header
ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Chris Miller20k

I have split my vcf using these commands. The files were fine. I was able to read them in R. But now I am facing problem in merging them. I have tried using joinx vcf-merge and also vcftools. I will really appreciate your help.

Using joinx vcf-merge xaa.vcf xab.vcf -o merged.vcf I get this error:

Error while parsing header of xaa.vcf: Failed to extract token while parsing custom type ID=VT,Number=.,Type=String,Description="Alternate allele type. S=SNP, M=MNP, I=Indel">

ADD REPLYlink modified 12 months ago • written 12 months ago by humeira.tayyab0
1
gravatar for Devon Ryan
5.8 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

If you just want it split arbitrarily, rather than by chromosome, just use the split command. if you want N files rather than N lines per file, use "wc -l" first to get the number of lines in the file and then just do the division followed by "split -l ...". This could simply be scripted.

See also this thread on stackoverflow, which contains an example script to do exactly this.

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Devon Ryan90k

Thank you, but this wouldn't add headers.

ADD REPLYlink written 5.8 years ago by rickyflintoff60

Indeed, I forgot those! ashutoschmits' link gives the variant of my suggestion to use.

ADD REPLYlink written 5.8 years ago by Devon Ryan90k
0
gravatar for Rm
5.8 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

Asuming header: top 20 lines of VCF: and spliting every 1000 lines after header

head -20 test.VCF | tee header subset.1.VCF >/dev/null ; awk -v header="`cat header`" -v count=1 '( (NR>20) && !( (NR-1) % 1000) ) { count++ ; print header >"subset." count ".VCF";} {print $0 >>"subset." count ".VCF";}'  test.VCF
ADD COMMENTlink written 5.8 years ago by Rm7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 596 users visited in the last hour