Hello all, I am working with Illumina sequencing data and trying to identify germline mutations present in the data that are relevant to the cancer. To filter out the noise I have required that : Greater than 25% of the reads map to any given variant in both the tumor and normal samples. Filtered out known dbSNP, Filtered out synonymous mutations (except for splice sites), Filtered out any variant who's overall depth is less than 15 reads. Crossed data with Cosmic to see if mutations exist in gene.
Recurrencey analysis (seeing how many samples a gene is mutated in) shows that many genes are mutated across many of the samples I have pooled... but the list is still way too large to really be useful. I need a way to filter it down more to remove sequencing artifacts and other systematic errors. One attempt I tried was using the synonymous mutations as a negative control, i.e. if a gene is relevant to the disease then the number of samples with non-synonymous mutations should be significantly higher then the number of samples with synonymous mutations (making the assumption that the latter would be due to chance alone). This filtered did not seem to work well in accomplishing my goal. Does anyone have any suggestions for things that I could try, papers I could read, or statistics I could use to help prioritize genes that are likely to be important? Any help at all would be greatly appreciated, Thanks again