Dear Community,
I would like to ask a question concerning the putative implementation of the COSMIC database, in a developed somatic filtering and annotation pipeline, based on WES data. In detail, this pipeline is related to paired samples from WES data of cancer patients, and has been briefly described in a previous post for a different question (https://www.biostars.org/p/320050/#320470).
Thus, based on the relative download section (https://cancer.sanger.ac.uk/cosmic/download), my questions are the following:
1) Which type/format of file(s) should I download or would be more appropriate? The relative txt format “COSMIC Mutation Data (Genome Screens)” ? Or the relative VCF file of all coding mutations (VCF/CosmicCodingMuts.vcf.gz) includes more information ?
2) Which types of filtering should or could be implemented, for removal of “putative benign” variants or germline ones ? Like in the above txt file, the columns MUTATION_SOMATIC_STATUS and MUTATION_VERIFICATION_STATUS ? Moreover, even some kind of a frequency filtering criterion of variants could be applied ?
3) Concerning the study and the specific nature of the cancer studied: as the data analyzed are whole exome sequencing, and the cancer is small cell lung cancer, would be appropriate and also possible to subset the data in order to keep only WES data, as also keep as lung for the primary site ?
4) For an alternative source for my purpose, I have found another very useful database, that contains WES data, filtered from COSMIC, however, they are from an older version: https://www.cancerrxgene.org/downloads
Any suggestions or ideas would be grateful !!
Best,
Efstathios
In general, all the variant (external or internal) should have same schema/method of storage (variant, source, version, metadata). Otherwise, it would lead to confusion to developers/programmers. Once you figure out storage, you also need to think about matching logic and post matching operations.