Question: Differential Expression of a normalized RNAseq expression dataset
gravatar for Arko
8 days ago by
US/Boston/Boston University
Arko10 wrote:

I have a RNAseq expression matrix obtained from TCGA (TCGA-GBM) it's been normalized. The columns are for sample IDs and the rows are for genes. I want to do a differential expression analysis. I also have the copy number matrix for the same dataset from cbioportal but I'm not exactly sure on how to use this to my advantage and how to divide the given dataset based on amplification. I am completely clueless on how to use DeSeq2 (creating the DeSeq object, without the required information). If someone could elaborate on the division of the samples based on copy number amplification and differential expression analysis, that would help a lot.

rna-seq deseq tcga R glioblastoma • 127 views
ADD COMMENTlink modified 7 days ago by Kevin Blighe17k • written 8 days ago by Arko10
gravatar for Kevin Blighe
7 days ago by
Kevin Blighe17k
University College London Cancer Institute
Kevin Blighe17k wrote:

If you have downloaded an expression matrix, then you have most likely obtained FPKM-UQ normalised counts, or (possibly, and mistakingly) downloaded the microarray normalised expression log base 2 ratios. You must not use either of these for differential expression analysis via DESeq2 or any other tool. If you want to conduct your own differential expression analysis, then obtain the RSEM or HTseq raw counts for each sample and then merge them together into a single data matrix (of raw counts). That, then, would become your input to DESeq2.

For GBM, I can see that RNA-seq was done with RSEM, which is fine. Here is the sample listing on the GDC Legacy Archive:

So, here's the plan:

  1. Download those raw count TXT files by obtaining the file manifest and using GDC Data Transfer Tool
  2. Input the data by looking Here

Then, proceed from there by following the tutorial...


For the copy number data, you may have to clarify the data format that you have obtained. It is most likely GISTIC-produced data. You could instead just following the TCGAbiolinks workflow in order to identify recurrent copy number alterations in your samples (go to Listing 8):

You may also consider taking a look at the methods that I used for doing this in a very recent publication using the TCGA-UCEC data:


ADD COMMENTlink written 7 days ago by Kevin Blighe17k

Thanks Kevin, I shall follow up on it and let you know how it works out!

ADD REPLYlink written 7 days ago by Arko10

Good luck - stay in touch.

ADD REPLYlink written 7 days ago by Kevin Blighe17k

A rather silly question, but from the file manifest how many of the files are required to be downloaded, as from what I see each .result file contains counts and corresponding genes apart from the annotation.txt files. There is normalized and RAW data, so if I were to download only the RAW data how would I filter them?

Apart from that the CNV link seems to be broken.

I did read your publication and the methodology that was implemented, it is quite helpful and the workflow seems to be similar to what I'm trying to do.

ADD REPLYlink written 7 days ago by Arko10

I see what you mean. You can just remove files that you don't need from the manifest. It is just a plain text file. Are you using Mac or Linux? The command you'd need would be:

grep -e ".rsem.genes.results" Manifest.txt

If you are using Windows, you could possibly edit the flle in Excel.

I say this in assuming that the rsem.genes.results files contain the estimated 'raw' counts.

You may additionally want to look at this very old thread: Interpreting TCGA .rsem.genes.results and .rsem.genes.normalized_results files.

ADD REPLYlink modified 7 days ago • written 7 days ago by Kevin Blighe17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1161 users visited in the last hour