I have a RNAseq expression matrix obtained from TCGA (TCGA-GBM) it's been normalized. The columns are for sample IDs and the rows are for genes. I want to do a differential expression analysis. I also have the copy number matrix for the same dataset from cbioportal but I'm not exactly sure on how to use this to my advantage and how to divide the given dataset based on amplification. I am completely clueless on how to use DeSeq2 (creating the DeSeq object, without the required information). If someone could elaborate on the division of the samples based on copy number amplification and differential expression analysis, that would help a lot.
If you have downloaded an expression matrix, then you have most likely obtained FPKM-UQ normalised counts, or (possibly, and mistakingly) downloaded the microarray normalised expression log base 2 ratios. You must not use either of these for differential expression analysis via DESeq2 or any other tool. If you want to conduct your own differential expression analysis, then obtain the RSEM or HTseq raw counts for each sample and then merge them together into a single data matrix (of raw counts). That, then, would become your input to DESeq2.
For GBM, I can see that RNA-seq was done with RSEM, which is fine. Here is the sample listing on the GDC Legacy Archive:
So, here's the plan:
- Download those raw count TXT files by obtaining the file manifest and using GDC Data Transfer Tool
- Input the data by looking Here
Then, proceed from there by following the tutorial...
For the copy number data, you may have to clarify the data format that you have obtained. It is most likely GISTIC-produced data. You could instead just following the TCGAbiolinks workflow in order to identify recurrent copy number alterations in your samples (go to Listing 8):
You may also consider taking a look at the methods that I used for doing this in a very recent publication using the TCGA-UCEC data: