I ask you help because I'm in a situation I struggle with. I am working o a project with the aim to find somatic mutation in a certain tumor. To achieve this whole exome seq. was performed on paired tomor-blood samples. I got into the project after the actual sequencing was performed and my duty was to analyse the data. After a while I was still working on them because despite the apparent good quality of the experiment something seemed wrong in the mutations I called.
Asking questions to the lab's boss I find out that the WES was not run with pairing blood-tumor samples, but rather They did 2 different runs: one with all the bloods from all the patients and one with all the tumor DNA samples.
I guess the intrinsic bias of the machine is relevant in this case and I really do not know for sure how to handle it. I guess many of the somatic mutations I called did not pass the validation test (q-PCR and ddPCR) for this reason.
Does anyone have some tips on how to "clean" the data? Thank you so much
Sounds like poor experimental design.Do you know if they at least processed all the samples identically, so same DNA extraction kit, same library prep kit? Were the sequencing platforms and sequencing depths comparable between the groups? How did you cann the mutations?
The libraries were prapared the same day despite being run at different times. All the reagents were the same. The seq depth is sometimes similar and sometimes difefrent in the two groups, depending on genome region. I used Varscan to call tumor somatic mutations, after generating mpileups with samtools from bwa-generated bam files.
Ah ok, so then the batch effect is probably smaller than I thought. Differences between runs are typically small, the most important thing is that they used the same kits for everything and did it on the same day. Did you use the fpfilter from VarScan? It contains many heuristic filters that are recommended. You can also try different tools. VarScan is ok but old and not maintained anymore.