Format GISTIC2 all_data_by_genes.txt and cBioPortal
3 months ago

The documentation for cBioPortal file formats discussing continuous copy number states that the GISTIC2 output file <prefix>_all_data_by_genes.txt can be used directly as the cBioPortal data file (after changing column names.) cBioPortal expects this data to be in LOG2 format.

I have a file all_data_by_genes.txt (NOTE: Not <prefix>_all_data_by_genes.txt) generated by a run of GISTIC2 against an amalgamated segment (*.seg) file. However, then I try to use it according to the documentation, cBioPortal errors out saying that there are negative numbers in the data fields of the file (and there are.) This makes me assume the file is not actually LOG2 data.

Does anyone know ...

  1. What is the data type/format of the data in this file?
  2. Should I be using a different output file instead of all_data_by_genes.txt?
  3. If I SHOULD be using all_data_by_genes.txt, do I need to convert the data?

Thanks! Mike

As it turns out, the problem was that the output file all_data_by_genes.txt often has a negative value in the Gene ID to (Entrez_Gene_Id) column. This breaks the import into cBioPortal.

It has been confirmed by the GISTIC2 developers that this negative value is normal behavior, and can be ignored. The solution is to clean the file, and change all negative values in the Gene ID column to "Na"

From the developer...

The negative integers can be safely ignored. They aren't really valid genes but we've used them internally to represent non-gene entities such as e.g. expression of miRNA.


