Revisit where to find CCLE RNAseq in FPKM or RPKM using RSEM values to perform normalization- as was never answered usefully
1
0
Entering edit mode
9 months ago

I would like to find the CCLE RNA expression file that has either effective gene sizes or FPKM /RPKM (where estimated RSEM values have been used) to do our own upper quartile normalizations for CCLE gene expression. I don’t like the way the TPM protein coding RNA files have been generated by taking the larger TPM files for 53,000+ analytes and simply extracting values as is for the subset of protein coding genes. RSEM reads should first be filtered for only protein coding genes and TPM should have then been recalculated for protein coding genes, which would give a different result where all the protein coding gene TPMs from each sample would then add up to the same value of 1 million. To me it looks like this may not have been done properly. Therefore, I would like to perform my own data normalization only using protein coding genes. I can see a gene count and RPKM file under CCLE 2019 but the gene counts are not RSEM expected values (I think they are raw counts) and it is unclear if RPKM was calculated with effective or constant gene sizes and/or using RSEM or just the gene counts (i.e., raw counts) file

RNA CCLE • 1.4k views
ADD COMMENT
0
Entering edit mode

Original fastq data is available: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=1&WebEnv=MCID_64d12e6c7645fa11d55e0f0f&o=acc_s%3Aa

It would be a significant amount of work (looks like 32 TB of raw data) but you can generate counts/data in any format you need.

ADD REPLY
0
Entering edit mode
9 months ago
LauferVA 4.2k

Quick google search does the trick:

https://www.cbioportal.org/study/summary?id=ccle_broad_2019

ADD COMMENT
0
Entering edit mode

Thanks LauferVA. Sadly, it doesn't solve the problem. The older files have what appears to be count data and not RSEM estimated counts. The updated RSEM count data is RSEM, but as per the DepMAP forum the gene sizes have not been released yet. That means there is no way to re-normalize the data and the TPM normalized data appear to have used non-protein coding genes in the normalization. I suppose there is some argument to be made for including non-coding RNA in the TPM normalizations, but that is not how the TCGA did it and as a biologist I don't like that approach as the non-coding RNAs do not compete for protein synthesis in quite the same way within a cell. I have replied on the DepMap forum and requested that they expedite release of the gene sizes to match the RSEM data files. Best, Mitch

ADD REPLY
0
Entering edit mode

sounds like you may need to approach CCLE directly, or accept the limitations of the datasets provided

ADD REPLY
0
Entering edit mode

Thanks, I am awaiting response in their forum. Very important widely used dataset so I hope they resolve the issue soon or at least tell me I am wrong. Best, Mitch

ADD REPLY

Login before adding your answer.

Traffic: 1728 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6