Question: TCGA RNASeq Data
0
gravatar for luisa
5 weeks ago by
luisa10
luisa10 wrote:

I'm fairly new to bioinformatics so please excuse my basic questions... I am trying to analyse RNA-Seq data from TCGA and I came across this tutorial Survival analysis of TCGA patients integrating gene expression (RNASeq) data ... In there it is advised to remove genes whose expression is = 0 in more than 50% of the samples... Since this data has already some level of preprocessing, I was wonderig if an expression level of 0 meant that the gene should not be considered in further analysis because it has no expression or does it mean that the level of gene expression was very low so it was set to 0?
Thanks!

rna-seq R • 154 views
ADD COMMENTlink modified 5 weeks ago by Kevin Blighe17k • written 5 weeks ago by luisa10

Just a general advice on TCGA data: I would recommend to analyze the data from scratch (fastq files which you can acquire by using Picard Tools), rather then the provided BAM files. This is the only way that you can be sure that proper QC is being performed.

ADD REPLYlink written 5 weeks ago by lshepard130
1

The BAM and FASTQ files are access controlled, of course. I recently re-analysed TCGA RNA-seq data but from the HTseq raw counts (open access). I never use pre-normalisd counts from Broad, TCGA Biolinks, or other sources.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe17k
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe17k
University College London Cancer Institute
Kevin Blighe17k wrote:

Hello, just a couple of points that are key to note:

  • the data used in that tutorial are RNA-seq v2 RSEM-normalised counts
  • only genes with 0 in >50% of samples are excluded; therefore, you will still find 0 values but only in those genes where 0 values do not comprise >50% of samples

Hope that this clarifies

Kevin

ADD COMMENTlink written 5 weeks ago by Kevin Blighe17k

Hi! Thank you for your response! I am also using that type of data, also from the TCGA Database... Maybe I wasn't clear in what my actual doubt was... My problem is that I don't know how to interpret an expression value of 0...

ADD REPLYlink written 5 weeks ago by luisa10
1

It means no expression or expression below the detectable limit of the employed technology. The majority of genes that have 0 expression values will be non-coding RNAs, 'predicted' genes, pseudogenes, etc, and others that may not be expressed in the tissue of study.

Note that the RSEM authors recognised the difficulty of transcripts with 0 read abundances:

However, these results suggest that further work is needed to develop prior distributions that can better handle the large numbers of transcripts with zero abundance that are typical of RNA-Seq data sets.

[source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3163565/]

ADD REPLYlink written 5 weeks ago by Kevin Blighe17k

I understand! Thank you very much!

ADD REPLYlink written 5 weeks ago by luisa10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 972 users visited in the last hour