Question: How to filter a VCF file with a list of CHR or contig IDs?
gravatar for lexic2713
22 days ago by
lexic27130 wrote:

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).

My contig list looks like:



dDocent_Contig_10000 etc.

I am considering the following script:

vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs

where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.

I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

The solution does not have to be strictly vcftools related. I'm open to exploring suitable awk/mawk/grep (etc.) options as well.

ADD COMMENTlink modified 22 days ago by drkennetz500 • written 22 days ago by lexic27130
gravatar for drkennetz
22 days ago by
drkennetz500 wrote:

If you had one contig per line (I'll call it contigs.txt) you could do a simple bash script using what you posted above:

cat contigs.txt | while read CONTIG; do
    vcftools --vcf TotalRawSNPs.vcf --chr $CONTIG --recode --recode-INFO-all --out $CONTIG.out

cat *.out >> FinalRawSNPs.vcf
rm -rf *.out

This would run the script for each contig in the list. I'm sure there are other ways to do this but this is a simple enough way that should be fairly fast.

ADD COMMENTlink written 22 days ago by drkennetz500

I have a question regarding this solution:

after concatenating all *.out into one file (FinalRawSNPs.vcf), is it really a vcf? Especially, the headers from *.out would be all over the lines. Maybe, one has to use bcftools merge like in this answer

ADD REPLYlink modified 22 days ago • written 22 days ago by e.rempel890

Oh, that's a good point I didn't really think of! I think rather than catting you'd need to join them back together with some tool that understands the vcf format, like bcftools.


ADD REPLYlink written 22 days ago by drkennetz500

Hi, thank you both for the very helpful comments! This bash script worked out well. Since I was using vcf files, after creating the CHR output files, I bgzipped then indexed them all before using bcftools concat to combine all of the CHR files back together. Created some messy long text list of Contig IDs towards the top of the final concatenated VCF file, but all the information, headers and INFO flags were retained just fine. Thanks!

ADD REPLYlink written 6 days ago by lexic27130
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1662 users visited in the last hour