I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks
bcftools can be used, and this will preserve the header as well.
bcftools view input.vcf.gz --regions chr1
To extract mutiple chromosomes pass them as comma separated. eg. --regions chr1,chr5
Note that this method is better than grep as it includes the VCF header. However, it won't change the header of the VCF file so the unselected chromosomes will still have their ID line, e.g ##contig=<id=chr1>. So don't rely on bcftools view -h subset.vcf
to verify what chromosomes are left in your VCF file.
In addition to the solutions already posted, you might try VCF Tools:
http://vcftools.sourceforge.net/man_latest.html
At this URL note the following ability:
SITE FILTERING OPTIONS
These options are used to include or exclude certain sites from any analysis being performed by the program.
POSITION FILTERING
--chr <chromosome>
--not-chr <chromosome>
Includes or excludes sites with indentifiers matching <chromosome>. **These options may be used multiple times to include or exclude more than one chromosome.**
This will preserve the header of course. In addition, the code posted above in the comments will also get the header as it is getting lines with # as well as chr[1-5] (the statement includes an or that will grab lines starting with # or with chr1, chr2, chr3, etc.
Keep in mind that the posted solution only works for single-digit chromosomes, so chr1, chr2, chr3 (...), but not chr10-22 and X. Using chr[1-22] will also not work, as you have to specify to search for double digits. If you want all regular chromosomes, so 1-22 and X, but discard U, random contigs and stuff from a VCF, use:
grep -w '^#\|chr[1-9]\|chr[1-2][0-9]\|chr[X]' in.vcf
or if your chromosomes have a chr prefix:
Better extend the pattern string by #CHROM to retain the column names. If this is missing, tools like VCFtools will complain.
Thanks, how can I update the vcf header?
How to split vcf file by chromosome?
Thanks, but this only extracts per chromsome, right? I want chr1 to chr5 in one file.