Why doesn't TCGA .rsem.isoforms.results have a length column?
15 months ago


I'm analyzing TCGA data and need the rsem output files. I thought those would be the files I download from GDC, which end with .rsem.isoforms.results. However, this is their format (first 5 lines of one of them):

isoform_id  raw_count   scaled_estimate
uc011lsn.1  0.00    0
uc010unu.1  16.09   4.35780241451249e-07
uc010uoa.1  4.00    1.08327337941115e-07
uc002bgz.2  21.91   4.34097970946371e-07

First, I'm not sure why the files have the RSEM output names (.rsem.isoforms.results), but are not actually RSEM outputs, according to this post, because they don't have the length column. Is there a way I can download the correct RSEM outputs from GDC legacy data? Second, I'm not sure if I can fix this manually? From what I understood, the length column is just the length of the transcript, so can I then just find out the length of each isoform ID (e.g. from UCSC) and add a corresponding column?

These files were downloaded from GDC API using Python.

I would really appreciate the help of someone more experienced.

TCGA RNA-Seq alternative splicing RSEM R • 385 views

