Here is a simple method, using only Python:
from cyvcf2 import VCF
cv = VCF("clinvar.vcf.gz") #get file from ClinVar and make sure you have the .tbi file too
gen = VCF("yourGenome.vcf.gz") #get this from your WGS provider with the .tbi file
def compare_vcf(cv, usr):
variants = {}
try:
cvv = next(cv)
usrv = next(usr)
except StopIteration:
return variants
while True:
if cvv.POS > usrv.POS:
try:
usrv = next(usr)
except StopIteration:
return variants
if cvv.POS < usrv.POS:
try:
cvv = next(cv)
except StopIteration:
return variants
if cvv.POS == usrv.POS:
if cvv.REF == usrv.REF and cvv.ALT == usrv.ALT:
variants[cvv.ID] = [cvv.POS, cvv.REF, cvv.ALT, cvv.INFO.get('CLNSIG')]
try:
cvv = next(cv)
except StopIteration:
return variants
return variants
import pandas as pd
chroms = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X','Y']
for i in chroms:
cv_chrom = cv(str(i))
gen_chrom = gen('chr'+str(i)) #make sure the format is correct for your genome
variants = compare_vcf(cv_chrom, gen_chrom)
output = pd.DataFrame.from_dict(variants, orient='index', columns=['POS', 'REF', 'ALT', 'CLNSIG'])
output.to_csv('chr'+str(i)+'.csv')
wow !
On Windows anyway, there are always errors
Which errors, and please elaborate further on what you have found. It should be no issue at all to compare your input VCF to a ClinVar VCF.
You need special hardware/software for bioinformatics. Usually, it means "not Windows" + at least 32GB of RAM + large HDD to keep all the databases + multi-core processor. Then these tools from github will work.