Question: How to split vcf file by chromosome?
2
gravatar for MAPK
2.4 years ago by
MAPK1.2k
United States
MAPK1.2k wrote:

I have tried several options on the web including a few post her on Biostar to split my VCF file by chromosome, but could not do it properly. Say  I have a vcfile called myvcf.vcf.gz and want to split that per chromosome, what would be the best way to split it by chromosome?

 

vcf • 7.9k views
ADD COMMENTlink modified 3 months ago by pyjiang220 • written 2.4 years ago by MAPK1.2k
9
gravatar for venu
2.4 years ago by
venu5.3k
Germany
venu5.3k wrote:
bgzip -c myvcf.vcf > myvcf.vcf.gz

tabix -p vcf myvcf.vcf.gz

tabix myvcf.vcf.gz chr1 > chr1.vcf

It will give chr1.vcf file containing variants for chr1. You can loop the last command over all the chromosomes. If you need vcf header also, use -h flag with last command.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by venu5.3k
1

How do you extract chrX, chrY and chrM? Doesn't seem to work for those.

ADD REPLYlink written 21 months ago by MAPK1.2k

Then your chromosomes are either called something else or missing.

ADD REPLYlink written 8 weeks ago by Click downvote600
2
gravatar for ricardo
18 months ago by
ricardo20
Brazil/Fiocruz/Minas Gerais
ricardo20 wrote:

You can use the snpSift splitchr command. It separates a vcf file according to the chromosomes.

The command is simple:

Java -jar snpSift splitChr file.vcf

ADD COMMENTlink written 18 months ago by ricardo20
1

I dont know what version you used but right now this function is used like this :

  java -jar SnpSift.jar split file.vcf

great set of tools btw.

ADD REPLYlink modified 8 weeks ago • written 13 months ago by tiago211287990
0
gravatar for willgilks
22 months ago by
willgilks240
United Kingdom
willgilks240 wrote:

The question can be answered in a single command, using a bash loop to feed chromosome names into GATK's "select variants" command as shown below, whereby "-L" specifies which chromosome to select (https://software.broadinstitute.org/gatk/). This has the advantage over other methods in that index files are generated "on-the-fly". Also GATK is usually pretty robust. If using a genome with many chromosomes named e.g 1-22, users should modify the loop parameters, to something like "for i in seq 1 22;do"

for i in chr2L chr2R chr3L chr3R chr4 chrX;do
        GenomeAnalysisTK -R ${ref_seq} \
            -T SelectVariants \
            -V my_flies.vcf \
            -L $i \
                -o my_flies.${i}.vcf
                        done;
ADD COMMENTlink written 22 months ago by willgilks240

Is this a python script?

ADD REPLYlink written 12 months ago by kirannbishwa01790

That's a bash loop executing a GATK java program.

ADD REPLYlink written 12 months ago by WouterDeCoster29k

Thanks much ! Btw, I just stumbled upon https://gigabaseorgigabyte.wordpress.com/2017/05/02/an-orphan-bioinformatician/ while following your profile in biostars and then wordpress.

I am mainly a evolutionary biologist with a huge transition into Bioinformatics. If you would like to discuss about your problem I hope I can help, though I am not a full fledged programmer at this point. I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

Interesting fact is that I also consider myself a orphan bioinformatician (I like this word !). My situation was even terrible; I started PhD using genome and RNASeq data analyses, then came upon this problem of haplotype phasing in F1 hybrids, which wasn't solvable using any tools available. Additionally, there was null support on the matters of programming in my department and lab, but learning in a hardway has opened my pathway to writing my own program.

Your Biostars rep is quite high so I am thinking you might be quite ahead of me, but I think it would not hurt to discuss.

Thanks, - Bishwa K.

ADD REPLYlink written 12 months ago by kirannbishwa01790

Thanks for reading. Feel free to comment on my blog post if you have suggestions, remarks or something else to add.

I have though managed to prepare a pipeline (or programme, whatever you may call it, lol) using python which is going to need a lots of cleaning and making it efficient, so I am learning. https://github.com/everestial/pHASE-Stitcher

That's the thing, you always keep learning and improving. Good luck writing your script!
Every now and then I have a look at old code and give it a makeover, it's surprising how much you learn in a few months.

ADD REPLYlink written 12 months ago by WouterDeCoster29k
1

I think the best way to help and discuss is to put your tool on github (not matter how much small it might be). I just helps me or other people to create a branch and suggest the changes. Depending on how it works out for you or how it achieves the goal of the pipeline, you can merge it into the master branch.

Are you on github? Else, I have found these to be quite helpful :)

http://product.hubspot.com/blog/git-and-github-tutorial-for-beginners

https://git-scm.com/docs/git-push

https://github.com/cubeton/git101/tree/master/TurtorialInfo

ADD REPLYlink modified 12 months ago • written 12 months ago by kirannbishwa01790

We are going fairly off topic here...
Thanks for the links, but I already have a github account and a few repositories.

ADD REPLYlink written 12 months ago by WouterDeCoster29k
0
gravatar for pyjiang2
3 months ago by
pyjiang220
United States
pyjiang220 wrote:

You can use vcftools. For example, if total 16 chromosomes

for i in {1..16};
do vcftools  --vcf  VCF_FILE  --chr $i  --recode --recode-INFO-all --out  VCF_$i;
done
ADD COMMENTlink modified 3 months ago • written 3 months ago by pyjiang220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1591 users visited in the last hour