How to filter a VCF file with a list of CHR or contig IDs?
1
0
Entering edit mode
10 months ago
acoles ▴ 10

I need to subset/filter a SNP vcf file by a long list of non-sequential contig IDs, which appear in the CHR column. My VCF file contains 13,971 contigs currently, and I want to retain a specific set of 7,748 contigs and everything associated with those contigs (headers, all variants and genotype information etc.).

My contig list looks like:

dDocent_Contig_1

dDocent_Contig_100

dDocent_Contig_10000 etc.

I am considering the following script:

vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs

where I list every contig ID individually with a --chr flag before. For this --chr flag, I cannot feed it a text file of contig IDs to keep, which would be ideal. If I list all contigs individually, it'll create a massive script in the command line.

I’ve seen options for filtering by a list of individuals, but not any clear option for filtering by CHR/contig IDs only. Is there a more efficient way to filter my vcf file by CHR/contig?

The solution does not have to be strictly vcftools related. I'm open to exploring suitable awk/mawk/grep (etc.) options as well.

vcftools vcf chromosome contigs awk • 603 views
ADD COMMENT
2
Entering edit mode
10 months ago
drkennetz ▴ 510

If you had one contig per line (I'll call it contigs.txt) you could do a simple bash script using what you posted above:

cat contigs.txt | while read CONTIG; do
    vcftools --vcf TotalRawSNPs.vcf --chr $CONTIG --recode --recode-INFO-all --out $CONTIG.out
done

cat *.out >> FinalRawSNPs.vcf
rm -rf *.out

This would run the script for each contig in the list. I'm sure there are other ways to do this but this is a simple enough way that should be fairly fast.

ADD COMMENT
1
Entering edit mode

I have a question regarding this solution:

after concatenating all *.out into one file (FinalRawSNPs.vcf), is it really a vcf? Especially, the headers from *.out would be all over the lines. Maybe, one has to use bcftools merge like in this answer

ADD REPLY
0
Entering edit mode

Oh, that's a good point I didn't really think of! I think rather than catting you'd need to join them back together with some tool that understands the vcf format, like bcftools.

Thanks!

ADD REPLY
1
Entering edit mode

Hi, thank you both for the very helpful comments! This bash script worked out well. Since I was using vcf files, after creating the CHR output files, I bgzipped then indexed them all before using bcftools concat to combine all of the CHR files back together. Created some messy long text list of Contig IDs towards the top of the final concatenated VCF file, but all the information, headers and INFO flags were retained just fine. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6