Hi there, I want to use the gene expression data from GDC/TCGA for further analysis, e.g. clustering across multiple cancer types. GDC offers the gene expression data in three versions: counts, FPKM, and FPKM-UQ. I am aware that gene expression data sets need to undergo some preprocessing steps, e.g. filtering for outliers and normalization. I am a newbie to this kind of data processing and analysis.
Now my question(s): - What is the state of the art preprocessing pipeline for the raw counts? I have found so many sources in my online search that I am totally unsure of what is best practice. - Does the data in FPKM-UQ format require any preprocessing? Should I use it at all? On the official documentation web page, it sounds as if they have normalized the data set with the intent for cross-sample comparison, but I have not found any workflow or similar working with this kind of data yet. Also, I most often read that data filtering should be conducted before normalization, which would not be possible here.
Any suggestion or help would be highly appreciated!