Hello, I need to collect a set of ~330,000 SNPs from a set of extended-VCF files (available here http://cdna.eva.mpg.de/denisova/VCF/hg19_1000g/ and here http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/). The only information I have on these SNPs are there dbSNP rs#s.
I've tried to do this using vcftools, but vcftools f'ed up the extended format aspects of the VCF files, and thus the output was useless.
I have been told by one of the individuals who generated these original files that I absolutely cannot use vcftools or GATK on these files because they break the extended format. I have been told that I need to use a "a simple VCF parser (SAMtools library/tabix implementation, e.g. pysam)", but I am a noob and don't know how to do this or most of what this means. I've asked my Python/computational biology professor for help on this before, and we could not figure this out. There is more filtering beyond just filtering rsids (e.g., modernC-archaicT or modernG-archaicA sites), but I am holding off on that for now, since I want to get the simple task done first.
I need to get some Pysam python code to use as a template to filter these vcf files. Can someone please help. Thanks!