I have a vcf file with 23 chromsomes and other unwanted contigs. I want to extract a VCF file with chromsome 1 to chromsome 5 in one file. I want to include the header line as well. How can I do this in the most efficient way? Thanks
In addition to the solutions already posted, you might try VCF Tools:
At this URL note the following ability:
SITE FILTERING OPTIONS These options are used to include or exclude certain sites from any analysis being performed by the program. POSITION FILTERING --chr <chromosome> --not-chr <chromosome> Includes or excludes sites with indentifiers matching <chromosome>. **These options may be used multiple times to include or exclude more than one chromosome.**
This will preserve the header of course. In addition, the code posted above in the comments will also get the header as it is getting lines with # as well as chr[1-5] (the statement includes an or that will grab lines starting with # or with chr1, chr2, chr3, etc.
Keep in mind that the posted solution only works for single-digit chromosomes, so chr1, chr2, chr3 (...), but not chr10-22 and X. Using chr[1-22] will also not work, as you have to specify to search for double digits. If you want all regular chromosomes, so 1-22 and X, but discard U, random contigs and stuff from a VCF, use:
grep -w '^#\|chr[1-9]\|chr[1-2][0-9]\|chr[X]' in.vcf