I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).
My contig list looks like:
I am considering the following script:
vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs
where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.
I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?
The solution does not have to be strictly vcftools related. I'm open to exploring suitable awk/mawk/grep (etc.) options as well.