Question

Differential gene expression analysis of publicly available TCGA data - best practices?

0

Entering edit mode

4.6 years ago

MieszkoL ▴ 10

Hi,

I'm a PhD student very new to bioinformatics and I'm getting really confused about the best way to do differential gene analysis on TCGA data.

First, I planned to use htseq-counts downloaded from xenabrowser by transforming them back and rounding to integers round(((2^x) - 1), 0). Is rounding the counts acceptable practice for the input to DESeq2? I've read this thread A: Normalisation of RNAseq data from UCSC Xena Browser, and I guess it should be ok, but I remember I stumbled on another thread where the conclusion was different (sorry, but I can't find it now), hence I started wondering if it's acceptable after all.

Another option I considered was to use tximport to read the transcript RSEM expected counts from the TOIL project, but there is no information about the transcript/effective length and I don't know how I can get it. There are also RSEM expected counts at the gene level, but I still can't use it without knowledge about the transcript length

tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE) :
all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE In addition: Warning message: Unnamed col_types should have the same length as col_names. Using smaller of the two.

Is it possible to obtain the transcript/effective lengths based on ENST ids? Or can you only do it with raw data? If it's not possible, then is it acceptable to use htseq-counts as described above? What's the best practice for DEG analysis of the publicly available TCGA data?

I'm sorry for perhaps stupid questions but I've read numerous threads and couldn't come to any conclusion. Thank you for help!

rna-seq R deseq2 tcga • 2.9k views

ADD COMMENT • link updated 4.6 years ago by Kevin Blighe 89k • written 4.6 years ago by MieszkoL ▴ 10

1

Entering edit mode

I cannot find the thread from support.bioconductor.org right now but if memory serves the DESeq2 developer at some point stated that rounding floats to integers is ok since the difference from a float to its nearest integer is tiny while true biological changes between groups are expected to be much larger, so rounding has basically no influence. Therefore converting back to normal scale, rounding and putting into DESeq2 should be ok.

ADD REPLY • link 4.6 years ago by ATpoint 88k

0

Entering edit mode

Thank you! I believe it's this one: https://support.bioconductor.org/p/105964/ I wasn't sure though if it's also fine for htseq-counts

ADD REPLY • link 4.6 years ago by MieszkoL ▴ 10

score 2 · Accepted Answer · 2020-11-22

2

Entering edit mode

4.6 years ago

Kevin Blighe 89k

Hello,

I wrote the answer that you linked.

Taking the HTseq data from Xena and transforming it back to raw integer counts (as I described) is absolutely fine.

It is also absolutely fine to obtain the RSEM data from the GDC Data Portal and import those via tximport.

Kevin

ADD COMMENT • link 4.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you Kevin! If I decided to try again with the RSEM counts anyway, how can I obtain the transcript length/effective length needed for tximport for TCGA data? I couldn't find it with Xena or GDC Data Portal. Do I need access to lower-level data? Could you point me in the right direction?

Mieszko

ADD REPLY • link 4.6 years ago by MieszkoL ▴ 10

0

Entering edit mode

Is the length not encoded in the files?

ADD REPLY • link 4.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Unfortunately, it's not. The only other data provided is the gencode.v23.annotation.transcript.probemap, which has chromStart and chromEnd variables, but I guess it includes the introns thus their difference can't be used as a substitute to transcript length.

Maybe I'm looking in the wrong place? I've used this particular URL: https://xenabrowser.net/datapages/?cohort=TCGA%20Pan-Cancer%20(PANCAN)&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

ADD REPLY • link 4.6 years ago by MieszkoL ▴ 10

1

Entering edit mode

I seem to also recall that gene lengths were not available. What I did (3 years ago) was retrieve the GENCODE FASTA transcriptome references and calculate the lengths from those sequences.

I since identified two other potentially easier approaches:

ADD REPLY • link 4.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you! So as far as I understand, in order to get the transcript lengths without raw data, I can use the GENCODE FASTA transcriptome references. However, getting only gene length is close enough to go with it? I'm curious how the gene length is generated. Since the predominant transcript isoform is cell type and tissue-specific, is the length we can obtain with EDASeq::getGeneLengthAndGCContent() mean/median of the known transcript lengths of the same gene?

ADD REPLY • link 4.6 years ago by MieszkoL ▴ 10

1

Entering edit mode

Since the predominant transcript isoform is cell type and tissue-specific

This is an issue that is outside of my control, and, indeed, outside of everybody's control. Technically, I would say, as a result of this issue, 99.999% of RNA-seq studies are biased.

I am not sure about EDASeq and I have only used it once for that function, getGeneLengthAndGCContent().

Using biomaRt, it is possible to get the length per every Ensembl transcript ID (ENST), which is perhaps what you want. Here, we are referring to the CDS length, so, cds_start minus cds_end