Question: Split VCF to individual scaffolds
1
gravatar for krp0001
3.7 years ago by
krp000120
Sweden
krp000120 wrote:

Dear Users,

I couldn't find any script that could split my vcf file that has scaffolds (10322) but not chromes number, i would like to split each scaffold into individual vcf file or a text file.

Thank you

vcf snp sequence assembly genome • 2.3k views
ADD COMMENTlink modified 5 months ago by zx87547.1k • written 3.7 years ago by krp000120
2
gravatar for geek_y
3.7 years ago by
geek_y9.4k
Barcelona/CRG/London/Imperial
geek_y9.4k wrote:

First get a list of uniq scaffolds:

bgzip merge-test.vcf
tabix merge-test.vcf.gz
gzcat merge-test.vcf.gz | grep -v "^#" | cut -f1 | sort | uniq > scaff_names​

Then Split by individual scaffolds:

parallel -a scaff_names "bcftools view merge-test.vcf.gz {} > {.}.vcf"​

or by for loop:

for scaff in `cat scaff_names`; do bcftools view merge-test.vcf.gz $scaff > $scaff.vcf; done​
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by geek_y9.4k

Hai Goutham,

thank you for your quick response, i have tired your script i don't what ''parallel -a'' is, i am running it on grid not in an local machine, i get an error -bash: parallel: command not found, 

 

thank you

ADD REPLYlink written 3.7 years ago by krp000120
1

For time being, you could use the 'for' loop option. You need to install GNU-Parallel. Its a wonderful tool.

ADD REPLYlink written 3.7 years ago by geek_y9.4k

hai again,

here is the few lines of output, missing headers in the vcf file.

##source_20150622.1=vcf-annotate(r953) -f Q=30

##contig=<ID=scaffold1>

##contig=<ID=scaffold2>

##contig=<ID=scaffold3>

##contig=<ID=scaffold4>

##contig=<ID=scaffold5>

i need something which has all the headers for each scaffold some thing like this: please see below

scaffold1       382     .       T       A       298.864 PASS    .      

scaffold2       460     .       C       T       2781.98 PASS    .

scaffold2       534     .       C       T       3079.93 PASS    .

scaffold3       828     .       C       T       3168.21 PASS    .

scaffold3       918     .       G       A       2002.13 PASS    .

scaffold4       929     .       C       T       1802.47 PASS    .

scaffold4       943     .       G       C       1848.41 PASS    .

scaffold5       995     .       G       A       897.007 PASS    .      

that has to be split into individual vcf files that has only one scaffold with variant information in each file. sorry if i haven't made it clear. 

ADD REPLYlink written 3.7 years ago by krp000120

run it on this file and let me know the problem you face. I could not understand from your data. 

ADD REPLYlink written 3.7 years ago by geek_y9.4k

Hai Gowtham, 

again looking for your help, i finally managed to split my large vcf file, with below script (from forums)

cut -f1 assm_1kb.fasta.fai | xargs -i mkfifo {}
cut -f1 assm_1kb.fasta.fai | xargs -i echo cat {} \| gzip \> {}.gz \&|sh
zcat fil-Q25-d10-QUAL30-PASS2.vcf.gz | awk '!/^#/{print>$1}'

the problem now is missing header for each scaffold, 

i managed to split the header with 

head -n 60 fil-Q25-d10-QUAL30-PASS2.vcf | grep "^#" >header

could you please help me with how to add the header information to each scaffold. please let me know if i am not clear. 

Thank you

ADD REPLYlink written 3.7 years ago by krp000120

If you are looking for very simplistic approach, I would do something like this:

grep -v "^#" input.vcf | cut -f1 | sort | uniq > scaff_names​

grep "^#" input.vcf > vcf_header

while read line
do
grep -v "^#" input.vcf  | awk -v scaff=$line '{ if ( $1==scaff ) print }' | cat vcf_header - > $line.vcf
done < scaff_names

This should also split VCF by chromosome/scaffold with header intact. Hope this helps.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by geek_y9.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1227 users visited in the last hour