I am doing mouse exome sequencing SNP calling using BaseRecalibrator from GATK. It requires input of mouse snp database with VCF format, who has experience on this? What I have is snp137.txt downloaded from Illumina resource. Thank you,
You can use vcf file for SNPs from Mouse Genome Project P (ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/). They also have mm9 version. Look carefully on their page in case you need mm9. You can use the whole file or extract the SNP calls for strain of your interest and use it. I am sure the SNP vcf file from MGP should be as comprehensive as dbSNP in terms of number of SNPs. I work on a particular mouse strain so i don't use the full file but SNPs between that particular strain and reference strain B6.
In case you are using VCF file from MGP, you still have to add "chr" string in front of each line except the header information. Also, make sure that the order of chromosome in VCF file is same as order of chromosome in BAM file (both chromosome order in header and chromosome order in sequences). I am sure it is not. To get the SNPs for your particular strain you can use the below code which is not the best one but will still work. The usage is python code.py VCFfile_MGP number_of_column_for_that_strain(129P2 is 9, 129S1 is 10) New_vcf_file
import re,sys,fileinput Argument =  Argument = sys.argv[1:] Filepath = Argument Strain_column = int(Argument) Outpath = Argument newfile = open(str(Outpath),"w") for line in fileinput.input([Filepath]): if line.startswith("#"): if line.startswith("##"): newfile.write(str(line)) continue else: header = "" header = ("\t".join(line.split("\t")[0:9]))+"\t"+line.split("\t")[Strain_column]+"\n" newfile.write(str(header)) rowlist =  rowlist = line.split("\t") genotype =  genotype = rowlist[Strain_column].split(":") if genotype[-1] == "1": if genotype == "1/1" or genotype == "2/2" or genotype == "3/3" or genotype == "4/4" or genotype == "5/5" or genotype == "6/6" or genotype == "7/7": newline = "" newline = "chr"+("\t".join(rowlist[0:9]))+"\t"+rowlist[Strain_column]+"\n" newfile.write(str(newline)) newfile.close()
Thank you, Rm, I did use the mouse snp data with VCF format from the link that you provided. However, when I run GATK, there is error as follows,
ERROR MESSAGE: Input files /raid1/rzeng/reference/mousesnp.vcf and reference have incompatible contigs: No overlapping contigs found
Seems like the reference sequence I downloaded from Illumina and VCF format mouse snp database are not compatible. I may consider to convert mouse snp137.txt to VCF format with whatever softwares. Any one has experience on this?
I will keep updated on my progress on this, thanks!