Question

Easy way to split VCF file by chromosome

2

Entering edit mode

2.3 years ago

ConvolutedGenome ▴ 30

Hi,

Im trying to submit a job on the TOPMED/Michigan imputation server, but it returns an error saying that I need to split my VCF by chromosome

Is there an easy way to do this? Will bcftools help?

sequence TOPMED chromosome Imputation SNP • 15k views

ADD COMMENT • link updated 9 months ago by Pierre Lindenbaum 161k • written 2.3 years ago by ConvolutedGenome ▴ 30

Ram · Answer 1 · 2022-01-19

9

Entering edit mode

2.3 years ago

Pierre Lindenbaum 161k

bcftools index -s in.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" ; done

ADD COMMENT • link 2.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

omg thanks!!

ADD REPLY • link 2.3 years ago by ConvolutedGenome ▴ 30

0

Entering edit mode

Hello,

I tried to do this

bcftools index -s /scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz | cut -f 1 | while read C; do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" ; done

but it throws an error:

[E::hts_open_format] Failed to open file in.vcf.gz
Failed to open in.vcf.gz: No such file or directory

ADD REPLY • link updated 14 months ago by Ram 43k • written 2.2 years ago by rheab1230 ▴ 140

1

Entering edit mode

of course it doesn't work. You should have a look at my command line and _understand_ how it works.

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

#!/bin/bash
source ~/miniconda3/bin/activate
conda activate train
cd /scratch/downloaded/dbGAP/
for C in {1 .. 22}
do
  bcftools index -s in.vcf.gz | cut -f 1 | while read C
  do
    bcftools view -O z -o /home/data_download/split.${C}.vcf.gz in.vcf.gz "${C}"
  done

I tried to do this but its throwing error:

line 9: syntax error: unexpected end of file

ADD REPLY • link updated 14 months ago by Ram 43k • written 2.2 years ago by rheab1230 ▴ 140

0

Entering edit mode

This will Work! enter image description here

ADD REPLY • link 9 months ago by nitin • 0

0

Entering edit mode

why a screenshot when you can just copy and paste the text ?

ADD REPLY • link 9 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

This is just wrong. Why do you use two loops when all the chromosomes are available in bcftools index -s

ADD REPLY • link 9 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I also tried this, but I am getting the following error:

[E::idx_find_and_load] Could not retrieve index file for 'in.vcf.gz'
Could not load index for VCF: in.vcf.gz

I saw elsewhere bcftools might be bugged (https://github.com/samtools/bcftools/issues/881)? Or am I doing this wrong? The workarounds seem a bit too involved for me to understand.

The command do bcftools view -O z -o split.${C}.vcf.gz in.vcf.gz "${C}" works great. I think it's just something with bcftools index.

For more context, I had some Illumina genotyping array data (.bed, .bim, .fam, .map, .ped) that I converted to .vcf using Plink (and then bgzip). Maybe that's why I'm losing the index somewhere?

Thanks for the help.

ADD REPLY • link 18 months ago by BroNobo • 0

0

Entering edit mode

You need to run bcftools index on your vcf file before running the suggested command.

ADD REPLY • link 13 months ago by Joana • 0

0

Entering edit mode

Or the more popular tabix -p vcf <vcf_file>.

ADD REPLY • link 13 months ago by Ram 43k

Ram · Answer 2 · 2022-07-14

3

Entering edit mode

21 months ago

LTDavid ▴ 50

See the explanation below the code

vcf_in=/scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz

vcf_out_stem=/scratch/Databases/GTEx/downloaded/dbGAP/by_chrom

for i in {1..22}
do
bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done

First, define vcf_in using your path to your input file. Here, I assumed that your input file was /scratch/Databases/GTEx/downloaded/dbGAP/GTEx_Analysis_2017-06-05_v8_WholeExomeSeq_979Indiv_VEP_annot.vcf.gz.
Deine vcf_out_stem using the path of the new file. vcf_out_stem should only include the portion of the path and file name leading up to the chromosome number, assuming you want the chromosome number in the path name. Here, I assumed that it would have the same path as the input vcf.
You don't need to make changes to the rest of the code. The for line means that it's going to repeat the code 22 times. The range {1..22} corresponds to the chromosomes you want to subset. This is going to loop through; each time, it's going to take your input file and subset the chromosome using --regions ${i}. If your naming convention for your chromosome is different, you may need to change it to that. For example, if your chromosome name in your input put is chr1, then you may need to use --regions chr${i}. And then -o ${vcf_out_stem}_${i}.vcf.gz is going to give you the new file name by using whatever you defined for vcf_out_stem and adding on the chromosome number using i.

ADD COMMENT • link 21 months ago by LTDavid ▴ 50

0

Entering edit mode

LTDavid:

My data have chromosomes 1..22, X, Y, and MT.

How do I modify the syntax for i in {1..22}

Thanks

ADD REPLY • link updated 14 months ago by Ram 43k • written 14 months ago by gebk • 0

2

Entering edit mode

To stick to the for, understand that {1..5} (for example) simply expands to 1 2 3 4 5, which means you can always add more space-separated values like {1..22} X Y MT.

for i in {1..22} X Y MT
do
  bcftools view ${vcf_in} --regions ${i} -o ${vcf_out_stem}_${i}.vcf.gz -Oz
done

ADD REPLY • link 14 months ago by Ram 43k

0

Entering edit mode

Thank you! I will let you know.

ADD REPLY • link 14 months ago by gebk • 0

0

Entering edit mode

The code works good. It created vcf.gz files by chromosome. I'm using Michigan Imputation Server. I uploaded all the files and run the imputation.

During the input validation phase, I got the following error message.

No valid chromosomes found!

Any idea why it gives the error message?

Thanks, again

ADD REPLY • link updated 14 months ago by Ram 43k • written 14 months ago by gebk • 0

1

Entering edit mode

it's unrelated to the original question.

ADD REPLY • link 14 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, it is but the next step after my earlier question. Shall I remove it and repost it on its own?

ADD REPLY • link 14 months ago by gebk • 0

0

Entering edit mode

Hi Ram, could you please comment on this too? I want to try convert 1..22 to chr1..chr22 and run the imputation. How do I modify the above syntax? I really appreciate your help.

ADD REPLY • link 14 months ago by gebk • 0

0

Entering edit mode

If you want to use chr1...chr22, change the command text like so:

for i in {1..22} X Y MT
do
  bcftools view ${vcf_in} --regions chr${i} -o ${vcf_out_stem}_chr${i}.vcf.gz -Oz
done

Just be aware that you'll probably need to use M and not MT as MT becomes chrM.

ADD REPLY • link 14 months ago by Ram 43k

0

Entering edit mode

Thank you very much, Ram.

ADD REPLY • link 14 months ago by gebk • 0

0

Entering edit mode

Ram, can you look at the syntax again? Something is not right. All the separate files have the same size which are probably close to empty. Thanks.

ADD REPLY • link 14 months ago by gebk • 0

0

Entering edit mode

Run the bcftools command without the loop just for chr22 and see the results. If things look OK, add an echo before the bcftools command in the loop and check that the echo'd commands look fine. This is a great opportunity for you to get debugging.

ADD REPLY • link 14 months ago by Ram 43k

0

Entering edit mode

(seq 1 22  && echo X Y MT) | while read i; do ....

ADD REPLY • link 14 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Please do not add answers unless you're answering the top level question. Use Add Comment or Add Reply as appropriate. I've moved your post to a comment this time.

ADD REPLY • link 14 months ago by Ram 43k