HI, I am new in next generation sequencing data analysis (exome sequencing) in mouse community, Very appreciated for giving me any suggestions if you have done with mouse sequencing work or you are doing it.
Here is what i have done so far.
Quality control (no problem)
short reads alignment by BWA ( I downloaded indexed mouse reference data (build mm10) from Illunmina which included genome.fa, genome.fa.amb/ann/bwt/fai/pac) . It was running good with using these indexed reference data without any problems. ( i used the same version of BWA as used in Illunmina data)
Creating indels table by RealignerTargetCreator using GATK software. (To use GATK, i generated genome.dict and genome.fai file using PICARD)
realigning reads around indels. I used GATK with Indelrealigner and it run good so far.
Quality score recalibration I used BaseRecalibrator of GATK to recalibrate base quality. Here, I downloaded mouse (snp137.vcf) variant data from Sanger. I was struck right here because the vcf data and reference data (genome.fa from Illumina) have incompatible contigs.
I need suggestions on the follows,
Do I need to index the mouse mm10 reference data using BWA but giving up using Indexed data from Illumina from the beginning? or It is good to use the data downloaded from only one resource ? in my pipeline, you can see that I use the indexed reference data or snpdata from two different resources (Illunima and sanger)
Where can I download the compatible or ready to use mouse reference data and VCF format snpdata (build mm10)? What I have collected of SNP data in my computer are mouse snp137.txt file (from Illumina), dbsnp137.vcf (from Sanger) and SC_MOUSE_GENOMES.genotype.vcf (from NCBI). As for reference data, I only use the genome.fa downloaded from Illumina.
I really need to make it consist which reference data (NCBI, Sanger or Illunmina) and dbsnp database (NCBI, Sanger or Illumina) are used in data analysis pipeline that will make my analysis more straight forward.