Entering edit mode
7.9 years ago
David_emir ▴ 460
I have downloaded TCGA Breast cancer normalised data sets from 1000 samples form RNA seq V2 . The counts files have only two columns such as Gene_id and Normalised Counts.
gene_id normalized_count 100130426 11.691 10357 114.6254
My goal is to do Differential expression analysis among these datasets, with various other clinical conditions such as Age, treated/untreated etc.
Please let me know whats the best possible way to do it. or is it possible to do DGE analysis with various clinical parameters?
Your suggestions is highly valuable. Thanks a lot for your help.
If you have the normalized data and the clinical variables, then it will be possible to perform differential expression, yes. Could you clarify what you are asking? Do you have software that you are going to use? Have you ever done differential expression analysis before?
Right now I don't have any software in my mind to do DGE. I have done DGE before from samples (BAM files) using Tuxedo protocol (Tophat--> Cufflinks --> Cuffdiff --> CummRband), But couldn't get how to continue with this type (TCGA, normalised count). I don't know how to proceed further. I don't have enough space to save the raw data files, because of this I thought of continuing with matrix files, which will be lesser in size from TCGA. but right now I am clueless as how to proceed further. Please help.
If you have count data, you could try edgeR.
DESeq2 would also be applicable.
I am doing the same type of analysis. I used TCGA assembler R package to get the actual data. Then matched the clinical data with my RNA-seq data (I am dealing with only one gene so it is easier I guess). Wrote a bit of code to make sure things are matched properly. Then used spss to correlate stuff to clinical factors.
I am also interested in gene expression alterations between normal and tumor. Here is where I am confused. Should I use the normalized_count by itself and compare the two groups? Or do a log2 transformation? Some resources including bioportal calculate up or down regulation based on Z-score.
DESeq2 and edgeR are great choices. Limma voom is another possibility. All of these take counts as input.
Thanks for your post. I am wondering if those software take normalized count or the raw count as input?
The answer depends on what you decide when moving forward with your analysis. Most count-based analysis software, including those mentioned above, will be looking for raw counts.
Could you also comment on my previous post? Should I use normalized_count by itself or do a log2 transformation?!
In general, you'll want to read the documentation for the software you are going to apply. They are often pretty clear about what to use. In particular, edgeR, DESeq2, and limma
voom()all ask specifically for raw counts.