I am interested in doing some survival analysis. Using TCGA data I want to determine if high levels of expression of my favorite gene results in lower survival times in pancreatic cancer patients compared to patients that lowly express my favorite gene.
I have been looking at this for a while now so I realize that there are tutorials and tools, but I am really trying to understand how to do this as a complete workflow starting from the rawest data I can get.
Here are the steps I am taking:
- I am starting from HTSeq-counts RNA-seq data from xenabrowser:
Straight from the source this data is actually partially transformed log2(count + 1), so I reverse this process (x^2 – 1) to get the actual raw counts.
I save only the tumor data and remove the “normal” sample data from this dataset.
I filter out lowly expressed genes by removing genes that have 0 counts in at least 50% of the sample.
I transform the raw counts using the
rlogtransformation in DESeq2.
From here I divide the
rlogtransformed counts by tertiles according to the expression of my favorite gene. I am designating the lower tertile of patients as “low expression of my favorite gene” and the upper tertile as “high expression of my favorite gene”.
I then plan to do cox regression using the coding from step six to bin my samples into low and high gene expressing subpopulations. At this point I am not as worried about the cox regression, I am more looking for feedback on my first few steps of data processing/transformation.
Are there any glaring flaws?
Thank you in advance for anyone that took the time to read this.