Hi, I'm Francesco, this is my first post on your nice forum so I suppose I should introduce myself. I'm a young MD with strong interest in base research and I'm trying to move into the world of bioinformatic. I hope you can give me some insights on the right direction to pursue and on some bioinformatic tools I'm ignoring to solve the following bioinformatic problem:
I got involved in a cancer research project, we are currently using 7 cell lines: 4 of them are of common use while the other 3 have been derived from a colleague and there is no available genomic information yet and for this reason we have sent all of them for whole exome sequencing.
The data that the company sent to us are already well annotated: type of mutation, presence in the COSMIC or CLINVAR databases, frequency in the 1000 genome database, SIFT score, Polyphen 2, VEST3, CADD, FATHMM etc. Nevertheless, I'm still facing some difficulty in getting sense out of it.
My first approach was to filter all the silent variants and variants with a frequency greater than 1% as reported in the 1000 genome database to remove polymorphisms and all silent mutations. Then it is easy for me to guess that mutations and framshift, stop_gained, start_loss lead to a loss of gene function, but how do I deal with missense, inframe_indel and splice variant? Unfortunately most of the variants are not present in the cosmic database and often, when present, they have only been highlighted as somatic in other studies or already reported by others who have analyzed the same cell lines, but there is no experimental confirmation or description of the role of the mutation.
I also tried to use SIFT, CADD and other scores to evaluate missense mutations, but often I find some mutations predicted as likely to damage the protein structure, but actually the same exact mutation is shared across some of my cell lines. Is that meaningful or should i consider it as an error in the sequencing/alignment process? Since I'm working on cell lines, I don't have a reference genome to distinguish germline mutations, nor so many samples to use a frequentist approach to evaluate recurrent mutated genes.
1.How do I discern passenger mutations from important mutations? That question is specifically important for oncogenes since most of the Gain of Functions mutations are missense mutations. The only approaches that comes to my mind is to hand check every mutation and verify if it's located in a domain already described to be affected by Gain of Functions mutations.
2. Can I predict CNV with a sample as limited as 7 cell lines? I see that different approaches leads to different resolutions and limitations. The xhmm approaches requires at least 50 samples in order to properly remove the noise of the reading counts, CONIFER should work with a sample as small as 8. Are there other tools available I should evaluate? Since CNV information are already available for 4 of my cell lines, can I use that available information to remove the reading counts noise from my samples and then estimate the CNV on the remaining 3 cell lines?
Thanks in advance for your help.