Question: How to work with Level 3 data (RPKM values) from TCGA database
gravatar for pbio
5.0 years ago by
United States
pbio130 wrote:

How to analyse Level 3 data (RPKM values) from TCGA? What Kind of analysis could be performed using level 3 data to find out differential expressed genes between normal and diseased samples?

rna-seq tcga • 9.4k views
ADD COMMENTlink modified 5.0 years ago by alolex910 • written 5.0 years ago by pbio130
gravatar for alolex
5.0 years ago by
United States
alolex910 wrote:

For RNA-seq data you can use DESeq2 or EdgeR to perform a differential expression analysis. These tools are part of Bioconductor in R.  However, both of these programs perform their own internal normalizations and they recommend you input the raw counts, not the RPKM or RSEM scaled estimates (for RNAseqV2 TCGA data).  If you are downloading from the TCGA Portal each patient's data is in a separate file, but if you go to and click on the Open link beside the cancer of interest you will find tar files that contain merged text files with all patients in one file.  Once you get all the raw counts or normalized counts in a single matrix you can analyze the data with any program that accepts a matrix of data; BUT make sure it is meant to be used on RNA-seq data because this type of data has different properties than microarray data and needs to be treated slightly differently when genes have low or zero read counts.  In doing my own comparisons between DESeq2 and EdgeR, I have found I prefer DESeq2 results because it compensates for low reads counts, which can artificially inflate fold changes.

ADD COMMENTlink written 5.0 years ago by alolex910

Thanks for guiding me to the path of RNAseq analysis using level 3 TCGA data, I would be very much grateful if you could tell me, after getting the level 3 data in a single matrix, how can I take a step ahead to process the data with DSeq and which one is the control data to compare with?

I am sorry for the trouble, since I am a complete newbie I am totally lost. It would be really helpful if you could answer only these two doubts of mine.

Thanks in advance 

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by pbio130

I would be glad to point you to some helpful resources.  Can you tell me a bit about your background?  Do you have computational/programming experience or used R before?  Have you taken any bioinformatics courses?  If you search BioStar for your questions, a lot of helpful answers from others come up.  Specifically, check the resources on this post: Bioinformatics courses, workshops or training.

ADD REPLYlink written 5.0 years ago by alolex910

I am a biotech student and have working experience with R. I understood the process you explained, but had a doubt where can I get the normal values? Because the files which I have download gives the number of samples with genes and raw count, So now how to figure out which one is the normal and diseased sample out of the all the samples?

ADD REPLYlink written 5.0 years ago by pbio130

Thank you for the additional information.  I think I understand your question now.  To identify tumor/normal samples you have to look at the TCGA barcode.  Some barcodes are long and others are truncated, but unless you are looking at the clinical data it will always have at least 4 fields in the format of TCGA-xx-xxxx-xxx.  The 4th field here is what tells you the sample type (normal, tumor etc).  Check out site for a detailed explanation of TCGA barcodes.  The full table for the 4th field (and all the others!) can be found here .  Hope that helps!     

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by alolex910

Thanks a lot @alolex.. Wanted to know that, which clinical data should be downloaded 









ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by pbio130


Sorry, I was mistaken when I said you could get the tumor/normal designations from the clinical data.  I had forgotten that I wrote a script to parse the barcodes manually.  In that script I parsed the barcodes and added the tumor/normal field based on the 4th field.  I edited my answer above to reflect this.  The script is an R script that is on my GitHub account at

If your looking to just get other clinical information about each patient, the Level_1 file should have most of it.  The aux file will have additional information that is tumor specific.  For example, aux information for HNSC patients includes HPV testing results.

ADD REPLYlink written 5.0 years ago by alolex910

Thankyou alolex

ADD REPLYlink written 5.0 years ago by pbio130

Thank you for the link you provided. It's very useful. I have downloaded the level3 files, such as "". How do I align the reads to the genome? Galaxy requires BED files. I have experience in R, Python but a newbie for the bioinformatics. Thanks in advance.

ADD REPLYlink written 4.9 years ago by will0

What question are you trying to answer? The file you reference contains counts of observed junctions in the RNAseq data for each TCGA patient sample.  If all you want to do is view the junctions in the genome as a whole you can do so by creating a BED file following the instructions in this post (  This won't contain any of the patient information though.  If you would rather convert the coordinate in this file directly to view them in Galaxy or IGB you will need to convert the first column to BED format.  The BED format is described here (  I took the first few lines of the file you mentioned and converted it to BED format using Excel and the "text to column" functionality.  You could also write a parser in Python or R to do this if you plan to do it often.  However, this won't show the patient information either.  If you are just wanting to look at the differentially expressed junctions, then you can identify those rows and convert the first column to BED format as below to view in a genome browser, but it depends on what you are trying to accomplish.  I don't work much with Galaxy, instead I opened IGB, loaded the most recent human genome, then drag-dropped the BED file I made into the viewer, zoomed to my region of interest, selected the BED track and clicked "Load Data".  I was able to see the gene annotations as well as the specific junctions in my BED file.  Hopefully, some of this information will help :)

Here are the first few entries in the file (first column only):


And her they are formatted as a BED file:

chr1    12227    12595    .    0    +
chr1    12227    12613    .    0    +
chr1    12227    12646    .    0    +
chr1    12697    13221    .    0    +
chr1    12721    13221    .    0    +
chr1    12721    13403    .    0    +
chr1    14829    14970    .    0    -
chr1    14829    15796    .    0    -
chr1    15038    15796    .    0    -
chr1    15942    16607    .    0    -
chr1    15947    16607    .    0    -
chr1    16765    16854    .    0    -
ADD REPLYlink written 4.9 years ago by alolex910

Thanks, alolex. I appreciate it. I think I need read more about bioinformatics.

ADD REPLYlink written 4.9 years ago by will0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour