Question: Splitting VCF file to decrease file size to run it on VEP and wANNOVAR
1
gravatar for S AR
15 months ago by
S AR50
Pakistan
S AR50 wrote:

I have a VCF file from a GDM patient it contains snps and indels from 1 sample only and i want to split it so that it size reduce to the size required by these tools online without getting the VCF format disruption. Any suggestions?

vcf • 1.1k views
ADD COMMENTlink modified 15 months ago by WouterDeCoster42k • written 15 months ago by S AR50
2

Related post: How to split vcf file by chromosome?

ADD REPLYlink written 14 months ago by zx87548.9k

Hello S AR,

Don't forget to follow up on your threads.

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.

Upvote|Bookmark|Accept

ADD REPLYlink modified 14 months ago • written 14 months ago by zx87548.9k
5
gravatar for finswimmer
15 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Shorter:

bgzip variants.vcf 
tabix variants.vcf.gz 
tabix -l variants.vcf.gz | parellel -j 5 'tabix -h  variants.vcf.gz {} > {}.vcf'

# annotate, creating annot_chr*.vcf
bcftools concat annot_chr*.vcf > annot_variants.vcf

From tabix manuals:

-l, --list-chroms List the sequence names stored in the index file.

ADD COMMENTlink modified 14 months ago by zx87548.9k • written 15 months ago by finswimmer13k
4
gravatar for WouterDeCoster
15 months ago by
Belgium
WouterDeCoster42k wrote:

I split by chromosome for things like that, using bgzip, tabix, unix commands, bcftools and gnu parallel (optional)

bgzip variants.vcf 
tabix -p vcf variants.vcf.gz 
zgrep -v '^#' variants.vcf.gz  | cut -f1 | sort -u > chromosomes.txt
cat chromosomes.txt | parallel -j 5 --bar 'tabix variants.vcf.gz {} > {}.prevcf'
zgrep '^#' variants.vcf.gz > header
ls *.prevcf | parallel -j 5 'cat header {} > {.}.vcf'
rm *.prevcf
# annotate, creating annot_chr*.vcf
bcftools concat annot_chr*.vcf > annot_variants.vcf
ADD COMMENTlink written 15 months ago by WouterDeCoster42k
1

WouterDeCoster some one posted a cool trick in getting chromosomes. After indexing, executing tabix -l variants.vcf.gz would list the chromosomes in vcf.

Edit: It is Fin :).

ADD REPLYlink modified 14 months ago • written 14 months ago by cpad011212k

Yes I'd definitely recommend the answer of finswimmer: C: Splitting VCF file to decrease file size to run it on VEP and wANNOVAR

ADD REPLYlink written 14 months ago by WouterDeCoster42k

Wow.. That's great. I will try this and then ill update here. Thank you so much.

ADD REPLYlink written 15 months ago by S AR50
3
gravatar for cpad0112
15 months ago by
cpad011212k
India
cpad011212k wrote:

Try vcftools:

for i in chr{1..22};do echo vcftools --chr $i --vcf input.vcf --recode -INFO-all --out $i.vcf;done

Remove echo when you are ready to execute.

If you are okay with gnu-parallel and vcftools, you can try this:

$ parallel --dry-run vcftools --chr {} --vcf input.vcf --recode -INFO-all --out {}.vcf ::: chr{1..22}

remove dry-run when you are ready to execute.

ADD COMMENTlink modified 15 months ago • written 15 months ago by cpad011212k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 878 users visited in the last hour