Question

Reference counts for RNA-seq

1

Entering edit mode

20 months ago

Kermit ▴ 90

I have TPM counts from 36 participants, but they are all diseased.

aligned to GRCh37.75
ensembl genes

Are there any good sources of "healthy" control TPM counts?

expression rna rna-seq tpm transcripts • 1.0k views

ADD COMMENT • link updated 20 months ago by ATpoint 82k • written 20 months ago by Kermit ▴ 90

1

Entering edit mode

The only good source would have been to include matched tissue in your study. RNA-seq is a relative assay and batch effects between unrelated studies make it close to impossible to meaningfully compare them unless prepared in the same batch, same kit, same procedures, same everything. Are you aware of that? What information do you exactly seek?

ADD REPLY • link 20 months ago by ATpoint 82k

1

Entering edit mode

That's right, you cannot do DE without samples from the same experiment. At most, you can compare the top expressed genes against the GTEX data to get an idea of whether your top expressed genes are also normally expressed in the same tissue. There are a few methods to do meta-analysis for RNASeq data, for example, this one metaseq - could the user apply one of these?

ADD REPLY • link 20 months ago by Giovanni M Dall'Olio 28k

1

Entering edit mode

That having said, it is completely unstandardized how TPM is calculated. Correctly one would use the length information based on the factual length of the transcripts being expressed, like e.g. the salmon-tximport pipeline provides it. But some people use either the entire length of the annotated genes, the average of all transcripts, the union if exons, etc to calculate it. Hence, on top of the experimental batch effect different sources of TPM might have in silico batch effects making even top-wise comparisons difficult. Sure, if something is zero in one and skyrocketing in the other it could be true, but everything else is to be considered with much care (or not at all).

ADD REPLY • link 20 months ago by ATpoint 82k

0

Entering edit mode

Thank you. I am constructing a diagnostic algorithm that uses gene TPMs to predict the presence of the disease. Then I want to permute the TPMs to figure out the most important genes. So I need TPMs from both cases and controls.

ADD REPLY • link 20 months ago by Kermit ▴ 90

0

Entering edit mode

Without knowing details it sounds you are using "ground truth" that is confounded, so in all likelihood performance of that algorithm will suffer. Arguments have been made above, on you to follow it or not, but I personally would get a collaborator and generate case and controls in a matched study to actually have a solid ground truth.

ADD REPLY • link 20 months ago by ATpoint 82k