Drilling down to the right samples in COSMIC
6 weeks ago
Vincent Laufer

Hello Biostars - Thank you for all the help lately - just one more question.

If I navigate to the COSMIC page, there are files containing the ~60 SBS loadings for individual cancer patients by cancer type.These can be obtained by downloading, for instance, "SigProfilier_PCAWG_WGS_probabilities_SBS.csv", which is a flatfile.

As can be seen below, each row is one patient's mutation rate for a given trinucleotide sequence while columns are the SBS signature types, like this:

Sample  Cancer Type Mutation Type   Mutation Subtype    SBS1    SBS2    SBS3    SBS4
SP117655    Biliary-AdenoCA C>A ACA 0.0045447   2.58E-06    0   0
SP117655    Biliary-AdenoCA C>A ACC 0.022974    0.0012906   0   0
SP117655    Biliary-AdenoCA C>A ACG 0.0083704   0.002148    0   0
SP117655    Biliary-AdenoCA C>A ACT 0.012359    0.00081708  0   0
SP117655    Biliary-AdenoCA C>G ACA 0.019838    4.12E-15    0   0
SP117655    Biliary-AdenoCA C>G ACC 0.019084    0.0018116   0   0
SP117655    Biliary-AdenoCA C>G ACG 0.0069102   0.00079127  0   0
SP117655    Biliary-AdenoCA C>G ACT 0.010542    0.00072964  0   0
SP117655    Biliary-AdenoCA C>T ACA 0.12931 0.00027331  0   0
SP117655    Biliary-AdenoCA C>T ACC 0.059811    0.011297    0   0
SP117655    Biliary-AdenoCA C>T ACG 0.97484 7.57E-05    0   0
SP117655    Biliary-AdenoCA C>T ACT 0.065793    0.011098    0   0
SP117655    Biliary-AdenoCA T>A ATA 0.016473    0.0017114   0   0
SP117655    Biliary-AdenoCA T>A ATC 0.056402    0.0192  0   0
SP117655    Biliary-AdenoCA T>A ATG 0.020153    0.00039076  0   0
SP117655    Biliary-AdenoCA T>A ATT 0.0024127   5.07E-15    0   0


The Sample column corresponds to the individual patients; can readily see this patient has biliary adenocarcinoma. OK, finally, here are the questions:

1) Biliary Adenocarcinoma is a good start. But, is there any way to drill down into these samples more? For instance, what would be the quickest way to separate the ~35 biliary adenocarc patients in this file into subcategories, for instance, IDH1+, IDH2+, FGFR2-fusion+, etc. ? I feel sure this must be possible. I'd prefer an annotated metadata like file, but if need be, I could probably download the raw data itself and figure out the drivers from that.

Is anyone familiar enough with this site to know a quick way to do it?

2) I imagine this is just like adjusting for loadings of other kinds, e.g. principal components. But, I wanted to ask, are there any pitfalls or idiosyncratic differences to be aware of? Example, do I need to match for gender? Alt splicing differs between sexes in drosophila, some of these cancers have dysregulated splicing, etc., etc. Just want to not make any mistakes.

