Question: Differential gene expression analysis of publicly available TCGA data - best practices?
0
gravatar for MieszkoL
9 weeks ago by
MieszkoL10
MieszkoL10 wrote:

Hi,

I'm a PhD student very new to bioinformatics and I'm getting really confused about the best way to do differential gene analysis on TCGA data.

First, I planned to use htseq-counts downloaded from xenabrowser by transforming them back and rounding to integers round(((2^x) - 1), 0). Is rounding the counts acceptable practice for the input to DESeq2? I've read this thread A: Normalisation of RNAseq data from UCSC Xena Browser, and I guess it should be ok, but I remember I stumbled on another thread where the conclusion was different (sorry, but I can't find it now), hence I started wondering if it's acceptable after all.

Another option I considered was to use tximport to read the transcript RSEM expected counts from the TOIL project, but there is no information about the transcript/effective length and I don't know how I can get it. There are also RSEM expected counts at the gene level, but I still can't use it without knowledge about the transcript length

tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE) :
all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE In addition: Warning message: Unnamed col_types should have the same length as col_names. Using smaller of the two.

Is it possible to obtain the transcript/effective lengths based on ENST ids? Or can you only do it with raw data? If it's not possible, then is it acceptable to use htseq-counts as described above? What's the best practice for DEG analysis of the publicly available TCGA data?

I'm sorry for perhaps stupid questions but I've read numerous threads and couldn't come to any conclusion. Thank you for help!

tcga rna-seq deseq2 R • 300 views
ADD COMMENTlink modified 9 weeks ago by Kevin Blighe69k • written 9 weeks ago by MieszkoL10
1

I cannot find the thread from support.bioconductor.org right now but if memory serves the DESeq2 developer at some point stated that rounding floats to integers is ok since the difference from a float to its nearest integer is tiny while true biological changes between groups are expected to be much larger, so rounding has basically no influence. Therefore converting back to normal scale, rounding and putting into DESeq2 should be ok.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by ATpoint44k

Thank you! I believe it's this one: https://support.bioconductor.org/p/105964/ I wasn't sure though if it's also fine for htseq-counts

ADD REPLYlink written 9 weeks ago by MieszkoL10
2
gravatar for Kevin Blighe
9 weeks ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Hello,

I wrote the answer that you linked.

Taking the HTseq data from Xena and transforming it back to raw integer counts (as I described) is absolutely fine.

It is also absolutely fine to obtain the RSEM data from the GDC Data Portal and import those via tximport.

Kevin

ADD COMMENTlink written 9 weeks ago by Kevin Blighe69k

Thank you Kevin! If I decided to try again with the RSEM counts anyway, how can I obtain the transcript length/effective length needed for tximport for TCGA data? I couldn't find it with Xena or GDC Data Portal. Do I need access to lower-level data? Could you point me in the right direction?

Mieszko

ADD REPLYlink written 9 weeks ago by MieszkoL10

Is the length not encoded in the files?

ADD REPLYlink written 9 weeks ago by Kevin Blighe69k

Unfortunately, it's not. The only other data provided is the gencode.v23.annotation.transcript.probemap, which has chromStart and chromEnd variables, but I guess it includes the introns thus their difference can't be used as a substitute to transcript length.

Maybe I'm looking in the wrong place? I've used this particular URL: https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

ADD REPLYlink written 9 weeks ago by MieszkoL10
1

I seem to also recall that gene lengths were not available. What I did (3 years ago) was retrieve the GENCODE FASTA transcriptome references and calculate the lengths from those sequences.

I since identified two other potentially easier approaches:

ADD REPLYlink written 9 weeks ago by Kevin Blighe69k

Thank you! So as far as I understand, in order to get the transcript lengths without raw data, I can use the GENCODE FASTA transcriptome references. However, getting only gene length is close enough to go with it? I'm curious how the gene length is generated. Since the predominant transcript isoform is cell type and tissue-specific, is the length we can obtain with EDASeq::getGeneLengthAndGCContent() mean/median of the known transcript lengths of the same gene?

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by MieszkoL10
1

Since the predominant transcript isoform is cell type and tissue-specific

This is an issue that is outside of my control, and, indeed, outside of everybody's control. Technically, I would say, as a result of this issue, 99.999% of RNA-seq studies are biased.

I am not sure about EDASeq and I have only used it once for that function, getGeneLengthAndGCContent().

Using biomaRt, it is possible to get the length per every Ensembl transcript ID (ENST), which is perhaps what you want. Here, we are referring to the CDS length, so, cds_start minus cds_end

ADD REPLYlink written 9 weeks ago by Kevin Blighe69k
1

Thank you so much for all the clarifications Kevin, I really appreciate it!

ADD REPLYlink written 9 weeks ago by MieszkoL10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1502 users visited in the last hour
_