Where To Download Mouse Mm10 Dbsnp Database With Vcf Format
3
3
Entering edit mode
8.6 years ago
Tonyzeng ▴ 310

I am doing mouse exome sequencing SNP calling using BaseRecalibrator from GATK. It requires input of mouse snp database with VCF format, who has experience on this? What I have is snp137.txt downloaded from Illumina resource. Thank you,

vcf • 15k views
ADD COMMENT
7
Entering edit mode
8.6 years ago

You can use vcf file for SNPs from Mouse Genome Project P (ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/). They also have mm9 version. Look carefully on their page in case you need mm9. You can use the whole file or extract the SNP calls for strain of your interest and use it. I am sure the SNP vcf file from MGP should be as comprehensive as dbSNP in terms of number of SNPs. I work on a particular mouse strain so i don't use the full file but SNPs between that particular strain and reference strain B6.

ADD COMMENT
0
Entering edit mode

thank you, ashutoshmits, I downloaded mgp.v3.snps.rsIDdbSNPv137.vcf.gz that I think as a whole file. Am I doing right? As for strain, I prefer to use B6 and 129, how can I download this snp data from the linker?

ADD REPLY
3
Entering edit mode
8.6 years ago

Hi Tony,

In case you are using VCF file from MGP, you still have to add "chr" string in front of each line except the header information. Also, make sure that the order of chromosome in VCF file is same as order of chromosome in BAM file (both chromosome order in header and chromosome order in sequences). I am sure it is not. To get the SNPs for your particular strain you can use the below code which is not the best one but will still work. The usage is python code.py VCFfile_MGP number_of_column_for_that_strain(129P2 is 9, 129S1 is 10) New_vcf_file



import re,sys,fileinput
Argument = []
Argument = sys.argv[1:] 
Filepath = Argument[0]
Strain_column = int(Argument[1])
Outpath = Argument[2]
newfile = open(str(Outpath),"w")
for line in fileinput.input([Filepath]):
        if line.startswith("#"):
                if line.startswith("##"):
                        newfile.write(str(line))
                        continue
                else:
                        header = ""
                        header = ("\t".join(line.split("\t")[0:9]))+"\t"+line.split("\t")[Strain_column]+"\n"
                        newfile.write(str(header))

        rowlist = []
        rowlist = line.split("\t")

        genotype = []
        genotype = rowlist[Strain_column].split(":")

        if genotype[-1] == "1":
                if genotype[0] == "1/1" or genotype[0] == "2/2" or genotype[0] == "3/3" or genotype[0] == "4/4" or genotype[0] == "5/5" or genotype[0] == "6/6" or genotype[0] == "7/7":
                        newline = ""
                        newline = "chr"+("\t".join(rowlist[0:9]))+"\t"+rowlist[Strain_column]+"\n"
                        newfile.write(str(newline))

newfile.close()

ADD COMMENT
0
Entering edit mode

ashutoshmits, I am trying make the VCF file "compatible" with the genome.fa ( indexed reference file downloaded from Illumina based on mouse mm10). HOwever, it did not work easily due to my limited experience on this, could you suggest me how to make it? what kind of tools you use to like add "chr" to each line and order the chromosome which is corresponding to the genome.fa chromosome order, et ac. Or if possible, would u like to share with me for variant VCF file (C57BL/6J and 129S6/SvEvTac background) which are ready for using?

thank you so much!

ADD REPLY
2
Entering edit mode

Hi Tony, I have changed my script so that you will have "chr" character appended in front of each genotype line. Try using that script it should be pretty easy. You won't know how easy it is until you try.

ADD REPLY
0
Entering edit mode

Thank you for the reply, Ashutoshmits, will you send scrip here or for somewhere else then i can get access to use?

ADD REPLY
0
Entering edit mode

The script above is the script you need to use. Just copy and paste it in a file and save the file as script.py . Then put the script in a system that has python and use command: python script.py MGP_all_strains.vcf_file (name of the big vcf file) column_number_strain New_output_file

ADD REPLY
0
Entering edit mode

Thank you Ashutoshmits, I just did not realize that you have made change for the above script.

ADD REPLY
0
Entering edit mode

Ashutosh, 

I ran the script as indicated and it seems to have appended the contig as indicated, and also selected the strain of interested (e.g. 16 = C57BL/6NJ. However, I noticed that the size of the new .vcf file is much diminished as compared to the original, and when looking at the file itself the number of lines has decreased by multiple orders of magnitude. When writing the new file does your script remove entries/rows that were present in the original. If so what criteria are used to determine when an entry/row is to be eliminated?

Thanks!

Regards,
John 

ADD REPLY
1
Entering edit mode

Dear John,

The big vcf file contains a comprehensive list of all variants that have been identified in different mouse strains. A large number of these variants are only present in a few strains and absent from the others. Some of them are even unique (private variants). For example, those variants that are unique to the reference genome (C57BL/6J) will appear in all the other strains because sequence data from all the other strains was compared against the reference genome. My script will fetch all the variants that have been identified for a strain irrespective of if they are unique to that particular strain or also present in other strains. Now coming back to your question, C57BL/6NJ is a substrain of C57BL/6J (reference) so they share a large fraction of IBD (identical by descent) regions. In other words, these strains have been recently separated and that's why don't show lots of differences when compared to each other. Both strains had a fairly recent common ancestor. Researchers started maintaining them (C57BL/6J and C57BL/6NJ) separately at Jackson and NIH and during this time both of these strains acquired spontaneous mutations. Nevertheless they shouldn't differ much and if I remember correctly they have ~20,000 variants (a fairly small file compared to the big file) between them. Actually, the real number should be even less than 20,000 if you remove the low quality variants.

ADD REPLY
0
Entering edit mode
8.6 years ago
Tonyzeng ▴ 310

Thank you, Rm, I did use the mouse snp data with VCF format from the link that you provided. However, when I run GATK, there is error as follows,

ERROR MESSAGE: Input files /raid1/rzeng/reference/mousesnp.vcf and reference have incompatible contigs: No overlapping contigs found

Seems like the reference sequence I downloaded from Illumina and VCF format mouse snp database are not compatible. I may consider to convert mouse snp137.txt to VCF format with whatever softwares. Any one has experience on this?

I will keep updated on my progress on this, thanks!

ADD COMMENT
1
Entering edit mode

it's usually a problem with the chromosome names. Check if the chromosomes in the REF and in the VCF have a prefix 'chr' or not

ADD REPLY
0
Entering edit mode

thank you for the suggestions! Pierre, could you suggest me that which software with which function can help me read the VCF format file?

ADD REPLY
0
Entering edit mode

gunzip -c file.vcf.gz |more

ADD REPLY

Login before adding your answer.

Traffic: 594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6