I am looking at the Level 3 CNV files on TCGA. I have a few questions:
I download copy number variance data from TCGA database and mapped genomic regions to gene symbols using this method(https://www.biostars.org/p/311199/#311746). Now i get a matrix that its rows are genes and its columns are samples.
If I want to use this data to cluster samples, how do I pre process the data? (p.s. for probes mapped to the same gene, I averaged their Segment_mean values, right?)
You are referring to a post that I made. From where did you obtain the original data? - Broad Firebrowse (somatic copy number alterations) or just downloaded the original files from GDC?
If you followed the data processing exactly as follows:
Part I - download segmented sCNA data for any TCGA cohort from Broad Institute's FireBrowse server and identify recurrent sCNA regions in these with GAIA
Part II - plot recurrent sCNA gains and losses from GAIA
Part III - annotate the recurrent sCNA regions (this post, just below)
Part IV -
generate heatmap of recurrent sCNA regions over your cohort
Then, the statistically significant recurrent somatic copy number alterations (sCNA) are held in the *.igv.gistic files. You can extract statistically significant regions from this file and then pull out the original copy number over these on a per sample basis using GenomicRanges - the copy number that you take is indeed the segment mean from the original copy number program that was used (in the case of TCGA data, likely DNAcopy (R)).
If you do that, then you can build a matrix of:
statistically significant recurrent sCNAs in a group of patients as
patients as columns
Segment Mean over each region as the values
With that, I generated this and identified clusters of patients based on recurrent sCNA via Partitioning Around Medoids (PAM)::