I am looking into genes of interest affected by CNVs using TCGA data. I am very confused about the immensely different results I get depending on the data source I use:
The GDC data portal (also available via TCGAbiolinks R package) provides a simple data.frame (genes / patients with -1 for losses, 0 for nothing and 1 for gains). This is how GDC CNV data was computed. This is h19. Here, I tend to get VERY few CNVs.
The Xena browser provides gistic2 thresholded files, which again is a simple table (genes / patients with -2,-1,0,1,2, for homozygous deletion, single copy deletion, diploid normal copy, low-level copy number amplification, or high-level copy number amplification). This is, however, hg18. Here, I get a lot of CNVs.
Finally, when I manually intersect the Masked Copy Number Segment file (GDC data but CNV segment level downloaded via TCGAbiolinks R package) with gene annotations and apply the same noise cutoff as suggested in the link above, I tend to get a little less than from the Xena data but still much more than stated on the GDC portal. This is h19.
So I am confused. Is the GDC gene level data differently computed? Or are these just homozygous losses / high-level copy number amplification? I very much appreciate input as I do not know which data to use.
Thanks so much!