My aim is to compare expression patterns of two genes in TCGA BRCA and do survival analysis downstream. I want to compare this data with our in-house generated data. Our in-house data ''transcriptome-level estimates'' from .ctab files from StringTie package and I have converted those files into a matrix using Tximport function into FPKM and then TPM values -log2(TPM+1).
My Question is: What data should I download from TCGA BRCA which gives me transcript level estimates?
I have noted that HTSeq-counts as well as HTSeq-FPKM - both are gene level estimates.
Also, another file (data.category = "Gene expression", data.type = "Gene expression quantification") which contains "raw counts" and "scaled estimates" - have gene level estimates or transcript level estimates? and the data from RSEM normalized_results are also gene level estimates.
I have also noticed, that if I get the data in raw counts and then convert it into TPM (using counttoTPM R script), I do not have information on gene length because counts_to_tpm <- function(counts, featureLength, meanFragmentLength) will require featureLength, meanFragmentLength, even if I take meanFragmentLength = 75, I still need featureLength for every gene/transcript.
Also, I get the library size information :
q = files() %>% filter(~ cases.project.project_id == 'TCGA-BRCA' & data_type == 'Aligned Reads' & experimental_strategy == 'WXS' & data_format == 'BAM') %>% select('file_id') %>% expand('analysis.metadata.read_groups') file_ids = ids(q) z = results_all(q) read_length_list = sapply(z$analysis$metadata$read_groups,'[[','read_length') z$analysis$metadata$read_groups %>% bind_rows() %>% as_tibble() %>% View()
But I have no idea how to use read_length to calculate TPM from raw counts..
So my question is how to get - transcript-level estimates in TPM form from TCGA BRCA data? What is the best way to do it?
Looking forward for your contribution.