Question: Ttest on Rna-seq data
gravatar for shivangi.agarwal800
11 days ago by
shivangi.agarwal80010 wrote:


I have transcript's TPM expression value data of cancer and normal samples in all stages of BRCA. The data is rna-seq taken from TCGA. I want to calculate the p-value i.e. significance level among normal and cancer samples. Can I apply student t test for the purpose? Total cancer samples = 1000 Total normal samples = 1100

With regards

rna-seq ttest • 175 views
ADD COMMENTlink modified 10 days ago • written 11 days ago by shivangi.agarwal80010

Related thread: A: Normalization of TCGA RNA-seq data (TPM) in R

ADD REPLYlink written 11 days ago by WouterDeCoster30k

Ok, thanks for your reply. I want to know that if I have expression values of different transcripts, and if I have to assign gene id/symbol to each of the transcript, than in that case what will be expression level of gene? Will it be average of different transcripts belong to that particular gene or it will be the expression of transcript with highest value. Is there any reference or paper regarding that?

ADD REPLYlink written 10 days ago by shivangi.agarwal80010

Do you mean different transcript isoforms? You can paste some examples of your data, if you want (?)

ADD REPLYlink written 10 days ago by Kevin Blighe24k

Ya sure The example data is:

uc010dfa.1 0.1056 2.1825 ABCA10
uc002jhz.2 0.251 4.4502 ABCA10
uc010wqs.1 0.3977 5.294 ABCA10
uc010dfb.1 0.0484 0.5487 ABCA10
uc010dfc.1 0.0976 1.1013 ABCA10
uc010wqt.1 0.0743 0.386 ABCA10

The data here shows the transcript id, expression in cancer cells(TPM), expression in normal cells(TPM) and the related gene. So, I have to assign expression value to gene, then what will be expression of ABCA10, should I do average of all transcripts expression or I have to consider the transcript with highest expression. Please suggest.

ADD REPLYlink modified 9 days ago • written 9 days ago by shivangi.agarwal80010

If you have no other information apart from this, such as gene length, then you may obtain the average expression for each gene. It is not ideal, but may be the best option given the data that you've got.

ADD REPLYlink written 8 days ago by Kevin Blighe24k

Ok, if there is gene length also, then what would be the method to calculate gene expression?

ADD REPLYlink written 7 days ago by shivangi.agarwal80010

Wait, actually, your data is already normalised, so, it may already have been adjusted for gene lengths. If you want to look further into it, tximport and DESeq2 take gene lengths into account for the purposes of normalisation. In your situation, it may very well be sufficient to just obtain the mean expression over the transcript isoforms. Otherwise, you could download the raw counts from TCGA and re-process them, but this could waste a lot of time.

ADD REPLYlink written 7 days ago by Kevin Blighe24k

Is there any reference quoting this?

ADD REPLYlink written 7 days ago by shivangi.agarwal80010

Kevin Blighe is a reference.

ADD REPLYlink written 7 days ago by WouterDeCoster30k

Actually, I want to know for citing purpose.

ADD REPLYlink written 7 days ago by shivangi.agarwal80010

Devon Ryan is a reference, too (I asked him).

Your data has already been heavily processed and is merely a summarisation (likely by mean) of the expression of tumour and normal samples. So, now your entire numerical data is just 2 columns: one for tumor; one for normal.

Unless you have the original data from which these were generated, there's not much else you can do other than summarise over the transcript-isofoms by median or mean. If you wanted to go back and ensure that transcript lengths were taken into account, then you'd have to obtain the original data, which would encompass raw or estimated counts over all samples.

Chances are that you didn't obtain this data direct from the TCGA, as the TCGA does not output data in that highly-summaised format. Third parties that use and re-process the TCGA data do output data in this format that you have, though. Check the exact data processing methods at the source from which you obtained this data.

If you have absolutely nothing else other than this data and don't know how to obtain the original raw data and re-process it faithfully, then perhaps the best that you can do is indeed a Mann-Whitney test (non-parametric t-test). It's really not ideal, though, and it's annoying that third parties would output data like this for end-users.

ADD REPLYlink modified 6 days ago • written 6 days ago by Kevin Blighe24k
gravatar for Kevin Blighe
11 days ago by
Kevin Blighe24k
Republic of Ireland
Kevin Blighe24k wrote:

Best not to, considering that the distribution of your data is likely not that expected by a Student's t-test.

Why not use a program dedicated to performing differential expression analysis? - obtain the RSEM raw counts for BRCA and process them through EdgeR or DESeq2. Then, you can obtain more 'credible' statistics.

I have already computed the stats for BRCA for both matched T-N and unmatched T-N, if you want.


ADD COMMENTlink written 11 days ago by Kevin Blighe24k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1363 users visited in the last hour