I'm a PhD student very new to bioinformatics and I'm getting really confused about the best way to do differential gene analysis on TCGA data.
First, I planned to use htseq-counts downloaded from xenabrowser by transforming them back and rounding to integers
round(((2^x) - 1), 0). Is rounding the counts acceptable practice for the input to DESeq2? I've read this thread A: Normalisation of RNAseq data from UCSC Xena Browser, and I guess it should be ok, but I remember I stumbled on another thread where the conclusion was different (sorry, but I can't find it now), hence I started wondering if it's acceptable after all.
Another option I considered was to use
tximport to read the transcript RSEM expected counts from the TOIL project, but there is no information about the transcript/effective length and I don't know how I can get it. There are also RSEM expected counts at the gene level, but I still can't use it without knowledge about the transcript length
tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE) :
all(c(abundanceCol, countsCol, lengthCol) %in% names(raw)) is not TRUE In addition: Warning message: Unnamed
col_typesshould have the same length as
col_names. Using smaller of the two.
Is it possible to obtain the transcript/effective lengths based on ENST ids? Or can you only do it with raw data? If it's not possible, then is it acceptable to use htseq-counts as described above? What's the best practice for DEG analysis of the publicly available TCGA data?
I'm sorry for perhaps stupid questions but I've read numerous threads and couldn't come to any conclusion. Thank you for help!