Question: What preprocessing steps are required for the different TCGA/GDC Gene Expression Data (Counts, FPKM-UQ)?
gravatar for cindy.perscheid
3.0 years ago by
Hasso Plattner Institute, Potsdam, Germany
cindy.perscheid90 wrote:

Hi there, I want to use the gene expression data from GDC/TCGA for further analysis, e.g. clustering across multiple cancer types. GDC offers the gene expression data in three versions: counts, FPKM, and FPKM-UQ. I am aware that gene expression data sets need to undergo some preprocessing steps, e.g. filtering for outliers and normalization. I am a newbie to this kind of data processing and analysis.

Now my question(s): - What is the state of the art preprocessing pipeline for the raw counts? I have found so many sources in my online search that I am totally unsure of what is best practice. - Does the data in FPKM-UQ format require any preprocessing? Should I use it at all? On the official documentation web page, it sounds as if they have normalized the data set with the intent for cross-sample comparison, but I have not found any workflow or similar working with this kind of data yet. Also, I most often read that data filtering should be conducted before normalization, which would not be possible here.

Any suggestion or help would be highly appreciated!

ADD COMMENTlink written 3.0 years ago by cindy.perscheid90

First of all you have to diced which files are better for your analysis, and then you can do a quality control analysis to see if further process are required for your approach. Once you select your files, you have to understand if your downstream analysis required internal normalization. For example, I used HTSeq files for DESeq analysis, to address this issue I did a QC and then normalization according to the program. I hope this can help.

ADD REPLYlink written 3.0 years ago by Lila M 820

Hi Lila, thanks for your reply (seems I do not get notifications about comments...). In the end I decided to use raw counts and have now built somewhat I can call a preprocessing pipeline by applying some sample workflows I have found on Bioconductor. What I understand now is that it seems that there is no state-of-the-art at all, only common practices when using this or that tool. This, unfortunately, makes it kind of hard to decide on a preprocessing strategy for newbies.

ADD REPLYlink written 2.9 years ago by cindy.perscheid90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 678 users visited in the last hour