Question

TCGA CNV data - immensely different results depending on data source

0

Entering edit mode

4.7 years ago

JJ ▴ 670

Dear all,

I am looking into genes of interest affected by CNVs using TCGA data. I am very confused about the immensely different results I get depending on the data source I use:

The GDC data portal (also available via TCGAbiolinks R package) provides a simple data.frame (genes / patients with -1 for losses, 0 for nothing and 1 for gains). This is how GDC CNV data was computed. This is h19. Here, I tend to get VERY few CNVs.

The Xena browser provides gistic2 thresholded files, which again is a simple table (genes / patients with -2,-1,0,1,2, for homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification). This is, however, hg18. Here, I get a lot of CNVs.

Finally, when I manually intersect the Masked Copy Number Segment file (GDC data but CNV segment level downloaded via TCGAbiolinks R package) with gene annotations and apply the same noise cutoff as suggested in the link above, I tend to get a little less than from the Xena data but still much more than stated on the GDC portal. This is h19.

So I am confused. Is the GDC gene level data differently computed? Or are these just homozygous losses / high-level copy number amplification? I very much appreciate input as I do not know which data to use.

Thanks so much!

genome CNV TCGA • 3.7k views

ADD COMMENT • link 4.7 years ago by JJ ▴ 670

score 2 · Answer 1 · 2019-08-19

2

Entering edit mode

4.7 years ago

Kevin Blighe 87k

From what I understand, the data on the GDC itself (which TCGAbiolinks uses) is just the segmented calls that have been called via Circular Binary Segmentation (CBS) using DNACopy (R). Try to think of it as a pseudo-raw form of copy number (technically, it is just that because the calls are made by observing the probe intensities from the microarray chip that was used).

GISTIC 2.0 can then be applied to this segmented data in order to produce a more summarised format. Other filtering occurs for, e.g., germline CNVs.

Keep in mind that the TCGA has been re-analysing much of their data in order to 'harmonize' it based on new methods and genome references.

---------------------------

I would not necessarily expect overlap between the data from any of these sources. There is zero / no regulation in bioinformatics, and copy number calling algorithms in particular exhibit much disagreement.

Personally, for copy number TCGA data, I just take the data from Broad Firebrowse: http://firebrowse.org/

There is also a very rough pipeline for it, here: A: How to extract the list of genes from TCGA CNV data

Finally, some useful DOCs for you:

Kevin

ADD COMMENT • link 4.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for your input!!

From what I understand, the data on the GDC itself (which TCGAbiolinks uses) is just the segmented calls that have been called via Circular Binary Segmentation. Try to think of it as a pseudo-raw form of copy number (technically, it is just that because the calls are made by observing the probe intensities from the microarray chip that was used).

GISTIC 2.0 is then applied to this segmented data in order to produce a more summarised format. Other filtering occurs for, e.g., germline CNVs.

Yes - these are then the Masked Copy Number Segment files. And then they use these files to compute the gene-level data.

I would not necessarily expect overlap between the data from any of these sources. There is zero / no regulation in bioinformatics, and copy number calling algorithms in particular exhibit much disagreement.

I absolutely agree but what bugs me here is that if I use the GDC Masked Copy Number Segment files and overlap them with gene annotations, as you have done in PART III (A: How to extract the list of genes from TCGA CNV data), I get completely different results as when compared to the gene-level data from GDC - this is the same data source. I have not applied the other steps you have described as I just want a vector for each of my gene of interest with a status (loss, none, gain) over the patients. So I've just downloaded the data from TCGAbiolinks, overlapped with the annotation, filtered with a noise cutoff of abs(0.3) and kicked out all segments with less than 300 probes and computed a status. These are actually more similar to the gene-level results from Xena (gistic2 thresholded files) or Firebrowse (CopyNumber_Gistic2.Level_4 - all_data_by_genes.txt files). I mean there are differences but much less - I get an overlap of about 80 %, which I find ok especially as Xena and Firebrowse report more. I find 40-60 % losses/gains for a gene of interest for example depending on the data source, which is also reported in a publication. But the gene-level GDC data says there are only 3% losses/gains - and that's what I find strange. I mean that's an immense difference....

Hence my question: Is the GDC gene level data differently computed? Or are these just homozygous losses / high-level copy number amplification? Or can I really expect such high differences?

Thanks so much for your input!

PS: I wanted to stick with TCGAbiolinks, as the rest of the analysis is based on that and I would like to stick to the same data source.

ADD REPLY • link 4.7 years ago by JJ ▴ 670

1

Entering edit mode

Well, the TCGA give the exact GISTIC 2.0 command that they used through the link. The last time that I obtained copy number data direct from the GDC, this extra GISTIC step was not implemented.

So, the steps for the harmonized data appear to be (starting with the raw signal data from the Affymetrix SNP 6.0 chip):

Birdsuite (makes a copy number determination from the probe signals)
DNAcopy (performs segmentation via CBS)
GISTIC 2.0 (performs some other shit..)

I'm aware that it's frustrating. The consortium (TCGA) got their Nature publications and then moved forward onto other areas. The data generated runs into petabytes. As funding dried up, there was then less to maintain the data. Some of the open access third level data, though, is 'dangerous', in my opinion, as it contains so much inconsistencies and bias. They could have just made the raw data available to everybody.

TCGAbiolinks (and other third party sources) add an extra amount of confusion to this because they utilise this third level data, which itself is constantly evolving, as we can see.

As long as you document your steps and version control everything, you will be fine.

ADD REPLY • link 4.7 years ago by Kevin Blighe 87k