Implementation of the COSMIC database for somatic variant annotation and filtering pipeline for WES cancer data
1
0
Entering edit mode
3.3 years ago
svlachavas ▴ 750

Dear Community,

I would like to ask a question concerning the putative implementation of the COSMIC database, in a developed somatic filtering and annotation pipeline, based on WES data. In detail, this pipeline is related to paired samples from WES data of cancer patients, and has been briefly described in a previous post for a different question (https://www.biostars.org/p/320050/#320470).

1) Which type/format of file(s) should I download or would be more appropriate? The relative txt format “COSMIC Mutation Data (Genome Screens)” ? Or the relative VCF file of all coding mutations (VCF/CosmicCodingMuts.vcf.gz) includes more information ?

2) Which types of filtering should or could be implemented, for removal of “putative benign” variants or germline ones ? Like in the above txt file, the columns MUTATION_SOMATIC_STATUS and MUTATION_VERIFICATION_STATUS ? Moreover, even some kind of a frequency filtering criterion of variants could be applied ?

3) Concerning the study and the specific nature of the cancer studied: as the data analyzed are whole exome sequencing, and the cancer is small cell lung cancer, would be appropriate and also possible to subset the data in order to keep only WES data, as also keep as lung for the primary site ?

4) For an alternative source for my purpose, I have found another very useful database, that contains WES data, filtered from COSMIC, however, they are from an older version: https://www.cancerrxgene.org/downloads

Any suggestions or ideas would be grateful !!

Best,

Efstathios

somatic variants COSMIC variant annotation WES • 2.5k views
0
Entering edit mode

In general, all the variant (external or internal) should have same schema/method of storage (variant, source, version, metadata). Otherwise, it would lead to confusion to developers/programmers. Once you figure out storage, you also need to think about matching logic and post matching operations.

2
Entering edit mode
3.3 years ago

1) Which type/format of file(s) should I download or would be more appropriate? The relative txt format “COSMIC Mutation Data (Genome Screens)” ? Or the relative VCF file of all coding mutations (VCF/CosmicCodingMuts.vcf.gz) includes more information ?

It depends on what data you want to use. CosmicCodingMuts.vcf.gz just contains coding mutations. I would obtain the COSMIC Mutation Data (Genome Screens) data because it will contain variants in coding and non-coding regions.

2) Which types of filtering should or could be implemented, for removal of “putative benign” variants or germline ones ? Like in the above txt file, the columns MUTATION_SOMATIC_STATUS and MUTATION_VERIFICATION_STATUS ? Moreover, even some kind of a frequency filtering criterion of variants could be applied ?

I do not believe COSMIC explicitly defines anything as benign; however, what you can select are variants that are definitively somatic variants. Here is what they say for MUTATION_SOMATIC_STATUS:

How do you define Mutation somatic status?

• Confirmed Somatic: The variant allele from the tumour sample differs from the germ-line alleles of the same individual who provided the tumour sample.

• Previously Reported: There is no germ-line allele information provided for the tumour sample for the same individual, but the same variant has been found to be 'Confirmed Somatic' variant in a normal-tumour sample pair from another patient. Please, note that the same variant from multiple samples from the same patient should always get the same somatic status, because all the samples share the same germ-line alleles in individuals who are not genetic mosaics.

• Variant of unknown origin: There is no information provided on the germ-line alleles in the data source to help determine if it is either a germ-line or somatic variant.

[source: https://cancer.sanger.ac.uk/cosmic/help/faq]

By selecting variants that have the first two categories, you eliminate the possibility that you have benign variants in your dataset. Keep in mind, though, the following situation: a particular base has 2 different alleles in Caucasians and Asians. An Asian develops cancer and, in the tumour, his/her base is mutated into the allele that is found in Caucasians. So, whilst this is definitively a somatic mutation in this individual, it would be seen as benign in the Caucasian population. So, even a variant having a somatic confirmed status is not evidence that it is in any way pathogenic.

3) Concerning the study and the specific nature of the cancer studied: as the data analyzed are whole exome sequencing, and the cancer is small cell lung cancer, would be appropriate and also possible to subset the data in order to keep only WES data, as also keep as lung for the primary site ?

Sure, but with WES data, you also get the 5' and 3' UTR regions, so, I would still look at the COSMIC genomic variant listing.

4) For an alternative source for my purpose, I have found another very useful database, that contains WES data, filtered from COSMIC, however, they are from an older version: https://www.cancerrxgene.org/downloads

That looks good, and it is a collaboration that includes the institute where COSMIC was developed (Wellcome Trust Sanger Institute); so, please feel free to use it.

Kevin

0
Entering edit mode

Dear Kevin,

1) So in your opinion, even though i have WES data, i should download the COSMIC Mutation Data, based on various points such as your the 5' and 3' UTR regions right ? even though in our pipeline, we are keeping only coding variants prior COSMIC, but it would not be any problem correct ? Or as we want to focus only in coding variants, then CosmicCodingMuts.vcf.gz would be the choise ?

2) Thank you also from the suggestions concerning the MUTATION_SOMATIC_STATUS and MUTATION_VERIFICATION_STATUS:

so, even for the scenario you described, these columns still would be a basic filter to consider regarding a putative somatic variant ?

and for example, basically keep only these variants that have at these 2 columns a specific value ?

3) In conjuction to my previous question:

concerning the "tissue filtering question" ? what is your opinion ? i should subset and focus on variants with a primary site as lung ? in order to be more focused in my analysis ?

4) Finally, based on the alternative source i have included in my post:

in the relative downloaded txt file with the "List of genomic variants found in Cell lines by whole exome sequencing" it mentions:

This data is provided from the COSMIC database (http://cancer.sanger.ac.uk), reflecting v71 released Sept 2014.

SAMPLE          Sample identifier provided in study
COSMIC_ID   Unique numerical identifier for the cell lines used in COSMIC
Cancer Type TCGA tissue classification
Gene            Gene name from Ensembl version 56
Transcript  Transcript identifier from Ensembl version 56
cDNA            Variant position and nucleotide change relating to the cDNA
AA                  Amno acid positon and alteration
Classification  Summary of variant type
Gene List   Identifier for variants within the 470 genes used in the study
Recurrence Filter   Based on frequency observed in COSMIC (v68) (Recurrance filter: See Extended Experimental Procedures ) plus fusion gene data
Subs           Missense/substitution variants occurring in codons mutated  in the systematic screen data in COSMIC (v68) (select >=3)
Truncating       Truncating variant count  from the systematic screen data in COSMIC (v68) (select >10)
inframe         Inframe indel alterations occurring in codons mutated in the systematic screen data in COSMIC (v68) (select >=3)


Overall, you believe that even this file is from an older version of COSMIC(V68), due to specific filters included, such as the mentioned Recurrence Filter, as also that it includes only WES data, it could be used for the filtering puproses ? despite that the relative COSMIC version is from 2014 ?

Best,

Efstathios

2
Entering edit mode

1) So in your opinion, even though i have WES data, i should download the COSMIC Mutation Data, based on various points such as your the 5' and 3' UTR regions right ? even though in our pipeline, we are keeping only coding variants prior COSMIC, but it would not be any problem correct ? Or as we want to focus only in coding variants, then CosmicCodingMuts.vcf.gz would be the choise ?

Yes, if you are filtering your own data to include just coding variants, then you should use the CosmicCodingMuts.vcf.gz file.

2) Thank you also from the suggestions concerning the MUTATION_SOMATIC_STATUS and MUTATION_VERIFICATION_STATUS:

so, even for the scenario you described, these columns still would be a basic filter to consider regarding a putative somatic variant ?

and for example, basically keep only these variants that have at these 2 columns a specific value ?

Yes, by including variants with the first 2 categories (Confirmed Somatic; Previously Reported), you are definitively including just somatic mutations. However, some of the third category (Variant of unknown origin) variants may still be somatic - there is no way to know.

3) In conjuction to my previous question:

concerning the "tissue filtering question" ? what is your opinion ? i should subset and focus on variants with a primary site as lung ? in order to be more focused in my analysis ?

This question is not really important, from my perspective... but I do not know the exact analysis that you are doing. Certain mutations will have greater importance in certain tissues, such as BRCA1 mutations and germline variants, which are 'felt' more in tissue that responds to oestrogen due to the fact that such such tissues suffer more DNA double-strand breaks than usual.

This appears to be more of a question for yourself and/or your supervisor...

4) Finally, based on the alternative source i have included in my post:

in the relative downloaded txt file with the "List of genomic variants found in Cell lines by whole exome sequencing" it mentions:

This data is provided from the COSMIC database (http://cancer.sanger.ac.uk), reflecting v71 released Sept 2014.

Overall, you believe that even this file is from an older version of COSMIC(V68), due to specific filters included, such as the mentioned Recurrence Filter, as also that it includes only WES data, it could be used for the filtering puproses ? despite that the relative COSMIC version is from 2014 ?

It's not a major issue using an older version - you just have to clearly state that you are using the older version by mentioning the version number. If you are worried about this, then just download the most updated information from the COSMIC website.

0
Entering edit mode

Thanks Kevin for the updates-concerning my current project and specific goal of analysis:

we have for an initial number of 3 patients both whole exome sequencing data ( Small Cell Lung Cancer-Genomic DNA captured using Agilent in-solution enrichment methodology/paired-end 75 bases massively parallel sequencing on Illumina HiSeq4000)

from CTCs (circulating tumor cells) and also exome sequencing data from biopsies of the same patients (so 2 tumor samples for each patient). Moreover, because both biopsies and circulating tumor cells were isolated from the same timepoint of diagnosis-where the tumor has already spread due to its "specific nature", so it is not definately primary tumor in both. I have both FASTQ files and BAM files for each patient.

Thus, as we have both Biopsy, circulating tumor cell and normal samples from the same patients, we would like to perform a comparison, and identify putatively "common mutational patterns" between these 2 different types of samples, which would validate the isolation extraction method protocol fir the CTCs, as they would capture the important "biology" of the biopsy, without being invasive as a technique.

That was my purpose for asking specifically if it "would" be more accurate to focus on lung as the primary tissue in the COSMIC database, to be more specific concerning the resulted variants