First of all, you can find the dataset I am referring to by going to http://gdac.broadinstitute.org/, clicking "Browse" under the Data column (for any dataset, but I am using HNSC), and downloading the file genome_wide_snp_6segmented_scna_minus_germline_cnv_hg19 under the heading "SNP6 CopyNum" in the pop-up that comes up. In this data, there is a column for segment means, which you can transform to copy numbers by doing 2*2^(segment mean).
I am looking to find regions of the genome with copy numbers that are amplified or deleted in cancer, when compared to normal tissue. I was under the impression that "minus germline" meant the data had already been standardized against normal tissue, so I have averaged the tumour (TP, aka "01A") samples over the entire genome. However, for most of the genome, this didn't result in many significant departures from a copy number of 2. Have I done something wrong? More precisely:
Has this data already been standardized against normal tissue? i.e. if a tumour sample has a copy number bigger than 2, does this mean it is an amplification in comparison to normal tissue?
If not, how would I go about standardizing the values? Most of the tumour samples have matching normal samples, but I am unsure whether to subtract segment means, take a ratio of segment means, subtract copy numbers, or take a ratio of copy numbers. Different sources have done different things.
I realize there is also the matter of the difference in amplification/deletion threshold (usually .2 and -.2), etc., and comments and suggestions here are appreciated as well, but my primary question is about the format of the data.