Looking for some feedback on my TCGA survival analysis approach. Any glaring flaws?
Entering edit mode
3.1 years ago
curious ▴ 640


I am interested in doing some survival analysis. Using TCGA data I want to determine if high levels of expression of my favorite gene results in lower survival times in pancreatic cancer patients compared to patients that lowly express my favorite gene.

I have been looking at this for a while now so I realize that there are tutorials and tools, but I am really trying to understand how to do this as a complete workflow starting from the rawest data I can get.

Here are the steps I am taking:

  1. I am starting from HTSeq-counts RNA-seq data from xenabrowser:


  1. Straight from the source this data is actually partially transformed log2(count + 1), so I reverse this process (x^2 – 1) to get the actual raw counts.

  2. I save only the tumor data and remove the “normal” sample data from this dataset.

  3. I filter out lowly expressed genes by removing genes that have 0 counts in at least 50% of the sample.

  4. I transform the raw counts using the rlog transformation in DESeq2.

  5. From here I divide the rlog transformed counts by tertiles according to the expression of my favorite gene. I am designating the lower tertile of patients as “low expression of my favorite gene” and the upper tertile as “high expression of my favorite gene”.

  6. I then plan to do cox regression using the coding from step six to bin my samples into low and high gene expressing subpopulations. At this point I am not as worried about the cox regression, I am more looking for feedback on my first few steps of data processing/transformation.

Are there any glaring flaws?

Thank you in advance for anyone that took the time to read this.

RNA-Seq tcga the cancer genome atlas survival cox • 1.4k views
Entering edit mode

Everything is good expect the "HTSeq-counts", which may not be the best way to quantify. Unless you have bam files, you pretty much can't do anything. Also make sure the data was not normalised in any which way other than log2 transformation. Usually normalised to sequence depth or some other normalisation like FPKM.

Entering edit mode

Thank you for responding. I am new to this so please forgive any ignorance. I am trying to learn.

Everything is good expect the "HTSeq-counts", which may not be the best way to quantify. Unless you have bam files, you pretty much can't do anything.

Why is this the case? As far as I understand DEseq2 is compatible with working from raw HTSeq-counts via importing with DESeqDataSetFromHTSeqCount before applying rlog. Where is the BAM information needed?


Also make sure the data was not normalised in any which way other than log2 transformation. Usually normalised to sequence depth or some other normalisation like FPKM

Other than the log2 transformation, I believe they are the raw counts without any other modification. Xenabrowser has separate files for FPKM normalized HTseq-counts. I will reach out to them to verify this though. One of Kevin Blighe's posts suggests these are raw counts as well (other than the log2 transformation): Normalisation of RNAseq data from UCSC Xena Browser

If it is really not recommended to use the "HTSeq-counts" from the link, what would you suggest starting from? Applying for tier1 and getting the BAMs? I would rather work with tier3, but have a PI that bother for a tier1 request if I absolutely need to.

Thank you so much.

Entering edit mode

I personally think HTseq-count is fine. FeatureCounts is slightly better (especially for paired-end reads) but it's not really that significant that it's worth your effort to get the raw .bam files.

I suppose an issue could be batch effects (but those are kinda hard to deal with in TCGA, and most people generally don't deal with them).

Entering edit mode

Thank you.

What do you think about what geek_y said about needing BAM files?

Blockquote "HTSeq-counts", which may not be the best way to quantify. Unless you have bam files, you pretty much can't do anything

I can't seem to see where I would need the BAM files to do my analysis:

  1. I import "HTSeq-counts" into DEseq2 with DESeqDataSetFromHTSeqCount

  2. I transform the imported counts with rlog

  3. I use the rlog transformed counts to set my upper and lower tertiles.

I might be missing something because I don't see a need for the BAMs in any of these steps.

Entering edit mode

You don't need BAM files to normalize count data with DESeq2.

Entering edit mode

Hey brett, I am just catching up and saw you refer to one of my previous posts. I neither see anything wrong with HTseq count - did you normalise them in DESeq2 prior to transforming to rlog? For the survival part and tertiles, there's no real standard way of doing this. You could also Z-transform the rlog counts and derive upper / lower intervals based on Z-scores. Recently, I did some survival analysis and went through each variable manually - I found that some only achieved interesting results after division into binary, tertiary, quartiles, or even sextiles for one.

Entering edit mode


First off, thank you for finding this post. Second, I have been researching how to do this for just about two weeks and I'll tell you that I read multiple posts by you on this topic everyday. I am sure you have some idea of how many lurkers you are helping but I just wanted to thank you again for the content you are generating.

So I was originally under the impression that rlog does a library size type normalization internally before the log-like transformation, based on the documentation:



"... transformations produce transformed data on the log2 scale which has been normalized with respect to library size or other normalization factors..."

and here:


"... For the log2 method, we need estimate size factors to account for sequencing depth (this is done automatically for the rlog method)..."

So originally I just had two steps and went straight from reading in the counts to rlog transformation and was assuming rlog was doing normalization for sizefactors internally:

dds <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                   directory = rna_data_directory,
                                   design= ~ 1)

transformed_dds <- rlog(dds, blind=TRUE)

The transformed_dds object above is literally what I was planning on setting my quartiles, tertiles, etc on.

Am I missing a normalization step before transformation?

Thank you,


Entering edit mode

Hey Brett, thanks for your words. Biostars now represents my own way of getting correct information to people in what is a field (bioinformatics) that lacks standardisation and in which much conflicting information exists. Regarding rlog, you are indeed correct: if you supply to rlog a DESeq object that has not yet been normalised, then the rlog function will detect this and do the normalisation 'on the fly', as I prove here:

airway$dex %<>% relevel('untrt')

dds <- DESeqDataSet(airway, design = ~ cell + dex)

first way

x <- assay(rlog(dds))

second way

dds <- DESeq(dds, betaPrior=FALSE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

y <-assay(rlog(dds))

check x and y



Login before adding your answer.

Traffic: 1434 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6