Question: How to split vcf file by chromosome?
3
gravatar for MAPK
4.5 years ago by
MAPK1.6k
MAPK1.6k wrote:

I have tried several options on the web including a few post her on Biostar to split my VCF file by chromosome, but could not do it properly. Say  I have a vcfile called myvcf.vcf.gz and want to split that per chromosome, what would be the best way to split it by chromosome?

 

vcf • 19k views
ADD COMMENTlink modified 2.0 years ago by cartoonist80 • written 4.5 years ago by MAPK1.6k

Related post: Splitting VCF file to decrease file size to run it on VEP and wANNOVAR

ADD REPLYlink written 20 months ago by zx87549.4k
19
gravatar for venu
4.5 years ago by
venu6.7k
Germany
venu6.7k wrote:
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf

It will give chr1.vcf file containing variants for chr1. You can loop the last command over all the chromosomes. If you need vcf header also, use -h flag with last command.

ADD COMMENTlink modified 23 months ago by RamRS28k • written 4.5 years ago by venu6.7k
3

How do you extract chrX, chrY and chrM? Doesn't seem to work for those.

ADD REPLYlink written 3.9 years ago by MAPK1.6k

Then your chromosomes are either called something else or missing.

ADD REPLYlink written 2.3 years ago by Click downvote690

Same problem here for the chromosome Y.

ADD REPLYlink written 15 months ago by krt0
1

For those asking how to get ALL the chromosomes into separate files:

VCF="yourfile.vcf"
VCFGZ="${VCF##*/}.gz"  # get basename and add .gz compression extension

bgzip -c $VCF > $VCFGZ  #compress vcf
tabix -p vcf $VCFGZ  # index compressed vcf
tabix --list-chroms $VCFGZ > chromosomes.txt  # save all the chromosome names into a file

while IFS= read -r line; do
  tabix $VCFGZ $line > $line.vcf;
done < chromosomes.txt  # make an individual vcf for each chromosome

Also, you can use that --list-chroms option just to find out the names of your chromosomes.

ADD REPLYlink modified 6 months ago • written 6 months ago by juniper.lake10
8
gravatar for cartoonist
2.0 years ago by
cartoonist80
Germany
cartoonist80 wrote:

If the VCF file is tabix-indexed, extracting a subset by region name, let say chromosome 1, can be done by bcftools:

bcftools view -r 1 file.vcf.gz
ADD COMMENTlink written 2.0 years ago by cartoonist80
4
gravatar for ricardo
3.6 years ago by
ricardo40
Brazil/Fiocruz/Minas Gerais
ricardo40 wrote:

You can use the snpSift splitchr command. It separates a vcf file according to the chromosomes.

The command is simple:

java -jar snpSift splitChr file.vcf
ADD COMMENTlink modified 23 months ago by RamRS28k • written 3.6 years ago by ricardo40
4

I dont know what version you used but right now this function is used like this :

  java -jar SnpSift.jar split file.vcf

great set of tools btw.

ADD REPLYlink modified 2.3 years ago • written 3.2 years ago by tiago2112871.2k

Yeah! You tell him!!

ADD REPLYlink written 15 months ago by williamsbrian5064260
1
gravatar for pyjiang2
2.4 years ago by
pyjiang240
United States
pyjiang240 wrote:

You can use vcftools. For example, if total 16 chromosomes

for i in {1..16};
do vcftools  --vcf  VCF_FILE  --chr $i  --recode --recode-INFO-all --out  VCF_$i;
done
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by pyjiang240
1
gravatar for ATpoint
2.1 years ago by
ATpoint36k
Germany
ATpoint36k wrote:

You can use this code to split any VCF by all chromosomes present in the VCF, using parallel and tabix:

#!/bin/bash

## Assumes uncompressed VCF, extracts entries by chromosomes using tabix and parallel:
if [[ ${1: -4} != ".vcf" ]]
  then
  echo '[ERROR]: Please provide an uncompressed vcf file'
  exit; fi

## Get chromosomes that are present in the VCF:
cat $1 | mawk '$1 ~ /^#/ {next} {print $1 | "sort -k1,1 -u"}' > ${1%.vcf}_chrs.txt

## Compress & index file with tabix:
cat $1 | mawk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' | bgzip -c > ${1%.vcf}_sorted.vcf.gz
tabix -p vcf ${1%.vcf}_sorted.vcf.gz

## Extract files by chr:
cat ${1%.vcf}_chrs.txt | parallel "tabix -h ${1%.vcf}_sorted.vcf.gz {} > ${1%.vcf}_{}.vcf"

## Usage: ./split_vcf_by_chr.sh input.vcf
ADD COMMENTlink modified 2.1 years ago • written 2.1 years ago by ATpoint36k
0
gravatar for willgilks
4.0 years ago by
willgilks300
United Kingdom
willgilks300 wrote:

The question can be answered in a single command, using a bash loop to feed chromosome names into GATK's "select variants" command as shown below, whereby "-L" specifies which chromosome to select (https://software.broadinstitute.org/gatk/). This has the advantage over other methods in that index files are generated "on-the-fly". Also GATK is usually pretty robust. If using a genome with many chromosomes named e.g 1-22, users should modify the loop parameters, to something like "for i in seq 1 22;do"

for i in chr2L chr2R chr3L chr3R chr4 chrX;do
        GenomeAnalysisTK -R ${ref_seq} \
            -T SelectVariants \
            -V my_flies.vcf \
            -L $i \
                -o my_flies.${i}.vcf
                        done;
ADD COMMENTlink written 4.0 years ago by willgilks300

Is this a python script?

ADD REPLYlink written 3.2 years ago by kirannbishwa011.2k

That's a bash loop executing a GATK java program.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k

Thanks much ! Btw, I just stumbled upon https://gigabaseorgigabyte.wordpress.com/2017/05/02/an-orphan-bioinformatician/ while following your profile in biostars and then wordpress.

I am mainly a evolutionary biologist with a huge transition into Bioinformatics. If you would like to discuss about your problem I hope I can help, though I am not a full fledged programmer at this point. I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

Interesting fact is that I also consider myself a orphan bioinformatician (I like this word !). My situation was even terrible; I started PhD using genome and RNASeq data analyses, then came upon this problem of haplotype phasing in F1 hybrids, which wasn't solvable using any tools available. Additionally, there was null support on the matters of programming in my department and lab, but learning in a hardway has opened my pathway to writing my own program.

Your Biostars rep is quite high so I am thinking you might be quite ahead of me, but I think it would not hurt to discuss.

Thanks, - Bishwa K.

ADD REPLYlink written 3.2 years ago by kirannbishwa011.2k

Thanks for reading. Feel free to comment on my blog post if you have suggestions, remarks or something else to add.

I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

That's the thing, you always keep learning and improving. Good luck writing your script!
Every now and then I have a look at old code and give it a makeover, it's surprising how much you learn in a few months.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k
1

I think the best way to help and discuss is to put your tool on github (not matter how much small it might be). I just helps me or other people to create a branch and suggest the changes. Depending on how it works out for you or how it achieves the goal of the pipeline, you can merge it into the master branch.

Are you on github? Else, I have found these to be quite helpful :)

http://product.hubspot.com/blog/git-and-github-tutorial-for-beginners

https://git-scm.com/docs/git-push

https://github.com/cubeton/git101/tree/master/TurtorialInfo

ADD REPLYlink modified 3.2 years ago • written 3.2 years ago by kirannbishwa011.2k

We are going fairly off topic here...
Thanks for the links, but I already have a github account and a few repositories.

ADD REPLYlink written 3.2 years ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1081 users visited in the last hour