Question: Where To Download Mouse Mm10 Dbsnp Database With Vcf Format
3
gravatar for Tonyzeng
7.1 years ago by
Tonyzeng300
Tonyzeng300 wrote:

I am doing mouse exome sequencing SNP calling using BaseRecalibrator from GATK. It requires input of mouse snp database with VCF format, who has experience on this? What I have is snp137.txt downloaded from Illumina resource. Thank you,

vcf • 13k views
ADD COMMENTlink modified 7.1 years ago by Ashutosh Pandey12k • written 7.1 years ago by Tonyzeng300
2

ftp://ftp.ncbi.nih.gov/snp/organisms/mouse_10090/VCF/00--README.txt

ADD REPLYlink written 7.1 years ago by Rm8.0k
7
gravatar for Ashutosh Pandey
7.1 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

You can use vcf file for SNPs from Mouse Genome Project P (ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/). They also have mm9 version. Look carefully on their page in case you need mm9. You can use the whole file or extract the SNP calls for strain of your interest and use it. I am sure the SNP vcf file from MGP should be as comprehensive as dbSNP in terms of number of SNPs. I work on a particular mouse strain so i don't use the full file but SNPs between that particular strain and reference strain B6.

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Ashutosh Pandey12k

thank you, ashutoshmits, I downloaded mgp.v3.snps.rsIDdbSNPv137.vcf.gz that I think as a whole file. Am I doing right? As for strain, I prefer to use B6 and 129, how can I download this snp data from the linker?

ADD REPLYlink written 7.1 years ago by Tonyzeng300
3
gravatar for Ashutosh Pandey
7.1 years ago by
Philadelphia
Ashutosh Pandey12k wrote:

Hi Tony,

In case you are using VCF file from MGP, you still have to add "chr" string in front of each line except the header information. Also, make sure that the order of chromosome in VCF file is same as order of chromosome in BAM file (both chromosome order in header and chromosome order in sequences). I am sure it is not. To get the SNPs for your particular strain you can use the below code which is not the best one but will still work. The usage is python code.py VCFfile_MGP number_of_column_for_that_strain(129P2 is 9, 129S1 is 10) New_vcf_file



import re,sys,fileinput
Argument = []
Argument = sys.argv[1:] 
Filepath = Argument[0]
Strain_column = int(Argument[1])
Outpath = Argument[2]
newfile = open(str(Outpath),"w")
for line in fileinput.input([Filepath]):
        if line.startswith("#"):
                if line.startswith("##"):
                        newfile.write(str(line))
                        continue
                else:
                        header = ""
                        header = ("\t".join(line.split("\t")[0:9]))+"\t"+line.split("\t")[Strain_column]+"\n"
                        newfile.write(str(header))

        rowlist = []
        rowlist = line.split("\t")

        genotype = []
        genotype = rowlist[Strain_column].split(":")

        if genotype[-1] == "1":
                if genotype[0] == "1/1" or genotype[0] == "2/2" or genotype[0] == "3/3" or genotype[0] == "4/4" or genotype[0] == "5/5" or genotype[0] == "6/6" or genotype[0] == "7/7":
                        newline = ""
                        newline = "chr"+("\t".join(rowlist[0:9]))+"\t"+rowlist[Strain_column]+"\n"
                        newfile.write(str(newline))

newfile.close()

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Ashutosh Pandey12k

ashutoshmits, I am trying make the VCF file "compatible" with the genome.fa ( indexed reference file downloaded from Illumina based on mouse mm10). HOwever, it did not work easily due to my limited experience on this, could you suggest me how to make it? what kind of tools you use to like add "chr" to each line and order the chromosome which is corresponding to the genome.fa chromosome order, et ac. Or if possible, would u like to share with me for variant VCF file (C57BL/6J and 129S6/SvEvTac background) which are ready for using?

thank you so much!

ADD REPLYlink written 7.1 years ago by Tonyzeng300
2

Hi Tony, I have changed my script so that you will have "chr" character appended in front of each genotype line. Try using that script it should be pretty easy. You won't know how easy it is until you try.

ADD REPLYlink written 7.1 years ago by Ashutosh Pandey12k

Thank you for the reply, Ashutoshmits, will you send scrip here or for somewhere else then i can get access to use?

ADD REPLYlink written 7.1 years ago by Tonyzeng300

The script above is the script you need to use. Just copy and paste it in a file and save the file as script.py . Then put the script in a system that has python and use command: python script.py MGP_all_strains.vcf_file (name of the big vcf file) column_number_strain New_output_file

ADD REPLYlink written 7.1 years ago by Ashutosh Pandey12k

Thank you Ashutoshmits, I just did not realize that you have made change for the above script.

ADD REPLYlink written 7.1 years ago by Tonyzeng300

Ashutosh, 

I ran the script as indicated and it seems to have appended the contig as indicated, and also selected the strain of interested (e.g. 16 = C57BL/6NJ. However, I noticed that the size of the new .vcf file is much diminished as compared to the original, and when looking at the file itself the number of lines has decreased by multiple orders of magnitude. When writing the new file does your script remove entries/rows that were present in the original. If so what criteria are used to determine when an entry/row is to be eliminated?

Thanks!

Regards,
John 

ADD REPLYlink written 5.5 years ago by john.finnigan0
1

Dear John,

The big vcf file contains a comprehensive list of all variants that have been identified in different mouse strains. A large number of these variants are only present in a few strains and absent from the others. Some of them are even unique (private variants). For example, those variants that are unique to the reference genome (C57BL/6J) will appear in all the other strains because sequence data from all the other strains was compared against the reference genome. My script will fetch all the variants that have been identified for a strain irrespective of if they are unique to that particular strain or also present in other strains. Now coming back to your question, C57BL/6NJ is a substrain of C57BL/6J (reference) so they share a large fraction of IBD (identical by descent) regions. In other words, these strains have been recently separated and that's why don't show lots of differences when compared to each other. Both strains had a fairly recent common ancestor. Researchers started maintaining them (C57BL/6J and C57BL/6NJ) separately at Jackson and NIH and during this time both of these strains acquired spontaneous mutations. Nevertheless they shouldn't differ much and if I remember correctly they have ~20,000 variants (a fairly small file compared to the big file) between them. Actually, the real number should be even less than 20,000 if you remove the low quality variants.

ADD REPLYlink modified 13 months ago by _r_am31k • written 5.5 years ago by Ashutosh Pandey12k
0
gravatar for Tonyzeng
7.1 years ago by
Tonyzeng300
Tonyzeng300 wrote:

Thank you, Rm, I did use the mouse snp data with VCF format from the link that you provided. However, when I run GATK, there is error as follows,

ERROR MESSAGE: Input files /raid1/rzeng/reference/mousesnp.vcf and reference have incompatible contigs: No overlapping contigs found

Seems like the reference sequence I downloaded from Illumina and VCF format mouse snp database are not compatible. I may consider to convert mouse snp137.txt to VCF format with whatever softwares. Any one has experience on this?

I will keep updated on my progress on this, thanks!

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Tonyzeng300
1

it's usually a problem with the chromosome names. Check if the chromosomes in the REF and in the VCF have a prefix 'chr' or not

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum132k

thank you for the suggestions! Pierre, could you suggest me that which software with which function can help me read the VCF format file?

ADD REPLYlink written 7.1 years ago by Tonyzeng300

gunzip -c file.vcf.gz |more

ADD REPLYlink written 7.1 years ago by Pierre Lindenbaum132k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour