How to extract specific chromosome from vcf file
3
14
Entering edit mode
6.4 years ago
MAPK ★ 2.0k

I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks

vcf • 38k views
6
Entering edit mode
grep -w '^#\|^[1-5]' my.vcf > my_new.vcf


or if your chromosomes have a chr prefix:

grep -w '^#\|^chr[1-5]' my.vcf > my_new.vcf

4
Entering edit mode

Better extend the pattern string by #CHROM to retain the column names. If this is missing, tools like VCFtools will complain.

grep -w '^#\|^#CHROM\|^chr[1-5]' my.vcf > my_new.vcf

0
Entering edit mode

Hi, i am using this command line to extract chr 1 to 22, X and Y but it gets me only chr1, 2 X and Y. What is wrong?

grep -w '^#\|^#CHROM\|^chr[1-22,X,Y]' in.vcf > out.vcf


EDIT: I just saw your solution under. So to grep all chr from 1-22 X and Y, I should do like this right?

grep -w '^#\|^#CHROM\|chr[1-9]\|chr[1-2][0-9]\|chr[X]\|chr[Y]'


thanks

0
Entering edit mode

Thanks, how can I update the vcf header?

1
Entering edit mode
0
Entering edit mode

Thanks, but this only extracts per chromsome, right? I want chr1 to chr5 in one file.

10
Entering edit mode
4.7 years ago
mg ▴ 170

bcftools can be used, and this will preserve the header as well.

bcftools view input.vcf.gz --regions chr1


To extract mutiple chromosomes pass them as comma separated. eg. --regions chr1,chr5

1
Entering edit mode

Note that this method is better than grep as it includes the VCF header. However, it won't change the header of the VCF file so the unselected chromosomes will still have their ID line, e.g ##contig=<id=chr1>. So don't rely on bcftools view -h subset.vcf to verify what chromosomes are left in your VCF file.

1
Entering edit mode

This worked well for me, too! For many chromosomes, do: -R, --regions-file <file> restrict to regions listed in a file

4
Entering edit mode
5.2 years ago
ATpoint 66k

Keep in mind that the posted solution only works for single-digit chromosomes, so chr1, chr2, chr3 (...), but not chr10-22 and X. Using chr[1-22] will also not work, as you have to specify to search for double digits. If you want all regular chromosomes, so 1-22 and X, but discard U, random contigs and stuff from a VCF, use:

grep -w '^#\|chr[1-9]\|chr[1-2][0-9]\|chr[X]' in.vcf

0
Entering edit mode

Hi I want to separate only chr21 from vcf how to do it. I tried above commands its not generating. can any one help me. Thanks for your suggestions.

0
Entering edit mode

You could do it from compressed VCFs also: zgrep -w '^CHRNAME' compressed.vcf.gz > out.vcf and this can be done in parallel zcat compressed.vcf.gz | parallel --pipe --block 2M grep {1} ::: -w '^CHRNAME' > subsampled.vcf

3
Entering edit mode
6.4 years ago
Vincent Laufer ★ 2.5k

In addition to the solutions already posted, you might try VCF Tools:

At this URL note the following ability:

    SITE FILTERING OPTIONS
These options are used to include or exclude certain sites from any analysis being performed by the program.

POSITION FILTERING

--chr <chromosome>
--not-chr <chromosome>

Includes or excludes sites with indentifiers matching <chromosome>. **These options may be used multiple times to include or exclude more than one chromosome.**


This will preserve the header of course. In addition, the code posted above in the comments will also get the header as it is getting lines with # as well as chr[1-5] (the statement includes an or that will grab lines starting with # or with chr1, chr2, chr3, etc.

0
Entering edit mode

Can you do something like --chr 1-23 ?

0
Entering edit mode

This won't work, separating chromosome names by commas neither (though it works for bcftools view --regions 1,2,3). Do rather vcftools --gzvcf <file.vcf.gz> --chr 1 --chr 2 [ etc. until 23] --recode --out subset_chr1-23