Synopsis of an unusual problem I have one whole genome sequence dataset (C.elegans) that I aligned to a reference genome and called for SNPs. I received a large number SNPs that were not expected (the problematic sample is processed exactly the same way as my other samples at the same time and other samples did not have this problem). I PCRed (from same DNA sample) a 600bp region with 6 SNP and sent for sanger sequencing, and did not find any of the SNP present).
Another strange thing is that both runs produced a number of anomalous read pairs in the two samples, where both mates point into the same direction. I was told that they cannot normally occur in Illumina data, unless there is an inversion on the sequenced chromosome, but that would mean i would have to postulate a lot of inversions in my worms, which is just ridiculous.
Details of my experiments After a forward EMS mutagenesis screen with C.elegans. We selected for a phenotype of our interest. To identify the causative mutation we did variant discovery mapping (aka bulk-segregant mapping), it involves backcrossing the EMS mutant once to the unmutagenised strain, and pool the phenotypic recombinant F2s for DNA extraction. WGS (average genome coverage 30-60) was performed with illumina Nextseq500 V2 (outsourced). The allele frequencies at each SNP position is extracted. The chromosomal region with SNPs of high allele frequency is the mapped region.
I isolated multiple strains from the screen performed alignment, and snp calling in the exact same way. Using a package called MiMoDd, which actually bulk handles samples from the same screen. Only 2 of my samples had a high number of unexpected SNPs and anomalous reads.
About the weird SNPs Filtering for high genotype quality do not significantly affect the unusual SNPs. The SNPs are highly enriched towards the end of all chromosomes, which is not a characteristic of EMS-induced SNPs but of wild C. elegans strain. We thought about a possible contamination of wild c. elegans strain (sequence known) in the sample but the position of the SNPs do not match up at all.
We asked the sequencing facility to reperformed the sequencing, as well as reconstructed the library (on the same DNA, because the whole backcross and isolation of recombinant F2s process is labourous). The different runs on the same sample produced almost equally large number of SNPs, always enriched toward ends of the chromosome, but only approximately 60% of the SNPs are exactly identical in position. 40% of SNPs were only present in one run or the other.
additional information This is the first attempt in our lab for this kind of bioinformatics analysis. We have asked Dr. Wolfgang Maier from the university of Freiburg for advice on this problem. His said 'I have found no evidence of any problem I've seen before, like a mix-up during demultiplexing, a specific bad flowcell or run'
Any suggestions about what may cause this unusual observation will be greatly appreciated.
Best regard, June Deng, Pocock Lab, Monash University, Australia