Question: How to split vcf file by chromosome?
3
gravatar for MAPK
3.3 years ago by
MAPK1.4k
United States
MAPK1.4k wrote:

I have tried several options on the web including a few post her on Biostar to split my VCF file by chromosome, but could not do it properly. Say  I have a vcfile called myvcf.vcf.gz and want to split that per chromosome, what would be the best way to split it by chromosome?

 

vcf • 13k views
ADD COMMENTlink modified 10 months ago by cartoonist50 • written 3.3 years ago by MAPK1.4k

Related post: Splitting VCF file to decrease file size to run it on VEP and wANNOVAR

ADD REPLYlink written 6 months ago by zx87547.3k
16
gravatar for venu
3.3 years ago by
venu6.1k
Germany
venu6.1k wrote:
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf

It will give chr1.vcf file containing variants for chr1. You can loop the last command over all the chromosomes. If you need vcf header also, use -h flag with last command.

ADD COMMENTlink modified 8 months ago by RamRS21k • written 3.3 years ago by venu6.1k
2

How do you extract chrX, chrY and chrM? Doesn't seem to work for those.

ADD REPLYlink written 2.7 years ago by MAPK1.4k

Then your chromosomes are either called something else or missing.

ADD REPLYlink written 13 months ago by Click downvote670

Same problem here for the chromosome Y.

ADD REPLYlink written 4 weeks ago by krt0
5
gravatar for cartoonist
10 months ago by
cartoonist50
Germany
cartoonist50 wrote:

If the VCF file is tabix-indexed, extracting a subset by region name, let say chromosome 1, can be done by bcftools:

bcftools view -r 1 file.vcf.gz
ADD COMMENTlink written 10 months ago by cartoonist50
3
gravatar for ricardo
2.4 years ago by
ricardo30
Brazil/Fiocruz/Minas Gerais
ricardo30 wrote:

You can use the snpSift splitchr command. It separates a vcf file according to the chromosomes.

The command is simple:

java -jar snpSift splitChr file.vcf
ADD COMMENTlink modified 8 months ago by RamRS21k • written 2.4 years ago by ricardo30
2

I dont know what version you used but right now this function is used like this :

  java -jar SnpSift.jar split file.vcf

great set of tools btw.

ADD REPLYlink modified 13 months ago • written 24 months ago by tiago2112871.1k

Yeah! You tell him!!

ADD REPLYlink written 4 weeks ago by williamsbrian5064170
1
gravatar for pyjiang2
14 months ago by
pyjiang230
United States
pyjiang230 wrote:

You can use vcftools. For example, if total 16 chromosomes

for i in {1..16};
do vcftools  --vcf  VCF_FILE  --chr $i  --recode --recode-INFO-all --out  VCF_$i;
done
ADD COMMENTlink modified 14 months ago • written 14 months ago by pyjiang230
0
gravatar for willgilks
2.8 years ago by
willgilks260
United Kingdom
willgilks260 wrote:

The question can be answered in a single command, using a bash loop to feed chromosome names into GATK's "select variants" command as shown below, whereby "-L" specifies which chromosome to select (https://software.broadinstitute.org/gatk/). This has the advantage over other methods in that index files are generated "on-the-fly". Also GATK is usually pretty robust. If using a genome with many chromosomes named e.g 1-22, users should modify the loop parameters, to something like "for i in seq 1 22;do"

for i in chr2L chr2R chr3L chr3R chr4 chrX;do
        GenomeAnalysisTK -R ${ref_seq} \
            -T SelectVariants \
            -V my_flies.vcf \
            -L $i \
                -o my_flies.${i}.vcf
                        done;
ADD COMMENTlink written 2.8 years ago by willgilks260

Is this a python script?

ADD REPLYlink written 24 months ago by kirannbishwa01980

That's a bash loop executing a GATK java program.

ADD REPLYlink written 24 months ago by WouterDeCoster39k

Thanks much ! Btw, I just stumbled upon https://gigabaseorgigabyte.wordpress.com/2017/05/02/an-orphan-bioinformatician/ while following your profile in biostars and then wordpress.

I am mainly a evolutionary biologist with a huge transition into Bioinformatics. If you would like to discuss about your problem I hope I can help, though I am not a full fledged programmer at this point. I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

Interesting fact is that I also consider myself a orphan bioinformatician (I like this word !). My situation was even terrible; I started PhD using genome and RNASeq data analyses, then came upon this problem of haplotype phasing in F1 hybrids, which wasn't solvable using any tools available. Additionally, there was null support on the matters of programming in my department and lab, but learning in a hardway has opened my pathway to writing my own program.

Your Biostars rep is quite high so I am thinking you might be quite ahead of me, but I think it would not hurt to discuss.

Thanks, - Bishwa K.

ADD REPLYlink written 24 months ago by kirannbishwa01980

Thanks for reading. Feel free to comment on my blog post if you have suggestions, remarks or something else to add.

I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

That's the thing, you always keep learning and improving. Good luck writing your script!
Every now and then I have a look at old code and give it a makeover, it's surprising how much you learn in a few months.

ADD REPLYlink written 24 months ago by WouterDeCoster39k
1

I think the best way to help and discuss is to put your tool on github (not matter how much small it might be). I just helps me or other people to create a branch and suggest the changes. Depending on how it works out for you or how it achieves the goal of the pipeline, you can merge it into the master branch.

Are you on github? Else, I have found these to be quite helpful :)

http://product.hubspot.com/blog/git-and-github-tutorial-for-beginners

https://git-scm.com/docs/git-push

https://github.com/cubeton/git101/tree/master/TurtorialInfo

ADD REPLYlink modified 24 months ago • written 24 months ago by kirannbishwa01980

We are going fairly off topic here...
Thanks for the links, but I already have a github account and a few repositories.

ADD REPLYlink written 24 months ago by WouterDeCoster39k
0
gravatar for ATpoint
11 months ago by
ATpoint16k
Germany
ATpoint16k wrote:

You can use this code to split any VCF by all chromosomes present in the VCF, using parallel and tabix:

#!/bin/bash

## Assumes uncompressed VCF, extracts entries by chromosomes using tabix and parallel:
if [[ ${1: -4} != ".vcf" ]]
  then
  echo '[ERROR]: Please provide an uncompressed vcf file'
  exit; fi

## Get chromosomes that are present in the VCF:
cat $1 | mawk '$1 ~ /^#/ {next} {print $1 | "sort -k1,1 -u"}' > ${1%.vcf}_chrs.txt

## Compress & index file with tabix:
cat $1 | mawk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | bgzip -c > ${1%.vcf}_sorted.vcf.gz
tabix -p vcf ${1%.vcf}_sorted.vcf.gz

## Extract files by chr:
cat ${1%.vcf}_chrs.txt | parallel "tabix -h ${1%.vcf}_sorted.vcf.gz {} > ${1%.vcf}_{}.vcf"

## Usage: ./split_vcf_by_chr.sh input.vcf
ADD COMMENTlink modified 11 months ago • written 11 months ago by ATpoint16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1694 users visited in the last hour