Question

RNA-seq analysis of multiple patients with single sample each

1

Entering edit mode

3.2 years ago

Joshua Soon ▴ 10

Dear Biostars community,

I am hoping to get some advice and help on an analysis I am intending to perform. I have a bunch of patient samples RNA-seq fastq data which was made available through a joint collaboration with a hospital. Unfortunately, each patient within the study had one single tumour tissue sample sent for RNA-seq to generate one set of fastq per patient. As for healthy tissue we have one sample with three replicates.

My main question is how should I go about to perform downstream analysis once I have obtained the read counts from the aligned fastq files? I am familiar with doing cell line and xenograft-based RNA-seq analysis with 3 or more samples, as the raw read counts can be fed into a Star – Htseq counts – Deseq2 pipeline for differentially expressed genes. The DE genes can then be subjected to gene ontology or GSEA analysis, etc... But in the case where I have many patients with a single sample each, how do I go about performing meaningful downstream analysis even when true statistical significance cannot be determined?

On a side note, I am aware that EdgeR is able to handle single sample analysis by specifying dispersion coefficients, but I've seen many other forum posts suggesting that this method is subjected to user bias and other limitations.

Some questions I am thinking about are:

Should I just look at gene expression level on a gene-by-gene basis and not focus so much on statistical differential expression?
Can I group patients by age, cancer stage, other demographics… and look at variation in gene expression across these groups?
What kind of figures should I be looking towards from such an analysis? (probably no volcano plots since no DE genes).

Thank you in advance!

RNA-Seq R rna-seq • 1.6k views

ADD COMMENT • link updated 3.2 years ago by i.sudbery 19k • written 3.2 years ago by Joshua Soon ▴ 10

0

Entering edit mode

As for healthy tissue we have one sample with three replicates.

From an independent person (not one of the patients)? Or is there a "healthy" sample from each patient (desirable but highly unlikely).

ADD REPLY • link 3.2 years ago by GenoMax 141k

0

Entering edit mode

It's from an independent donor (not one of the patients).

ADD REPLY • link 3.2 years ago by Joshua Soon ▴ 10

0

Entering edit mode

What question are you trying to answer with this dataset?

ADD REPLY • link 3.2 years ago by swbarnes2 14k

score 3 · Answer 1 · 2021-02-04

My view on this sort of data is that there are basically two different ways you can think about it:

As population type data -rather than 2 condition data - it can be clustered, look for PCAs, or look for outliers.
Treat each patient as a biological replicate, and any samples that come from the same patient as techincal replicates (and thus to be collapsed into a single sample). You can then do DE between normal and cancer.

You can read more of my thoughts on biological vs technical replication here: A: Replicates for RNA-seq from 1 cell line undergoing different treatments

Basically if you want to say something about one individual compared to another, then you need mulitple samples from each of the individuals. If you want to say something about two groups of individuals (say cancer and normal). They you need single samples from multiple individuals.

One thing that worries me is this:

As for healthy tissue we have one sample with three replicates.

Does this mean you only have health tissue from one person? If so, thats going to stop you from doing 2 above. You replication needs to be on the same level in each group.

score 2 · Answer 2 · 2021-02-04

Should I just look at gene expression level on a gene-by-gene basis and not focus so much on statistical differential expression?

Technically you can, but this is both laborious and not reliable as analysis without statistics is basically guessing plus harbors the danger that you cherrypick what you find interesting without actual data support.

Can I group patients by age, cancer stage, other demographics… and look at variation in gene expression across these groups?

Yes, that is commonly done in such a setup. Many cancer studies have no normal samples and focus on analysis within the cohort. I hope you have a large sample size, otherwise this is probably to be cumbersome.

What kind of figures should I be looking towards from such an analysis? (probably no volcano plots since no DE genes).

That entirely depends on the question you want to answer. Volcanos are just one of many ways to visualize the relationship between effect size and significance value. MA-plots are a different type that come without the strict need for a significance value, but this is all more visualiaztion of results than generating results itself. Without knowing details I would probably focus on a within-cohort analysis since you have no reliable normal source it seems. Please do not use the approach of feeding dispersion values to edgeR, because while that is technically possible it actually it is just making up a critical value of the analysis without data-driven support (imho).