gene expression design
Entering edit mode
10 days ago


I have 5 RNA-seq samples (with 3 repetition for each sample) from different cell lines that are not related to each other. My goal is to analyze gene expression, but I am unsure about the best approach to design the analysis. Since I don't have control samples as a reference for the comparison (e.g., healthy vs. cancer cells), I am focusing on the expression levels of each gene in each sample individually.

If I only have one sample, how should I interpret the results? How can I determine if a gene is highly expressed? Additionally, if I have only one condition (one sample without any reference sample), how should I create the design matrix?

If I had experiments with a few conditions that I wanted to compare to each other, I would do something like this:

group = factor(sampleinfo$condition)
y <- DGEList(count_data, group=group)
condition <- factor(sampleinfo$condition)
design <- model.matrix(~ 0 + condition)
yNorm <- calcNormFactors(y)

But I can't do model.matrix with only one sample.

Would it make sense to perform gene expression analysis on all samples together and normalize them with 'between samples' normalization methods, even though they are completely unrelated? This way, I could compile them into one table and identify the top 20 genes in each sample.

I would greatly appreciate any help!!!

R gene-expression RNA-seq • 350 views
Entering edit mode
9 days ago
rfran010 ▴ 990

I would need to reference my own workflows to remember/give detailed steps, but I believe you can leave out the model.matrix since you don't have any reps, same with dispersion estimates.

I would keep the calcNormFactors and then perform rpkm.

You can then represent these values in a heatmap to compare gene expression between the cell lines.

As a note, in this case I think it would make sense to only keep reads found in ALL samples. Unless you're confident these libraries were prepared and sequenced in a comparable manner, but it's hard to make that determination in general, and especially without reps (since any one sample could be an outlier, but you cannot detect that).

For extra rigor, maybe you can find published RNA-gene expression of related cell lines to see if your data makes sense and to give a reference point to establish how similar the 5 cell lines are.

Entering edit mode

Thank you for your response! I failed to mention that I have 3 replicates per sample. If I want to analyze just one sample (with 3 replicates) at a time, how can I determine if a gene is highly or lowly expressed without any baseline (no control)?

The 5 samples were simply sequenced together, and I don't need to compare them. Maybe mentioning them initially added unnecessary complexity to my question.

I'm more confused about designing the analysis than the technical aspects.

Entering edit mode

Are you interested in genes enriched specifically in each cell type? Or do you really just want a ranked list for each sample?

If you really only care about gene expression within the sample (ranked list), it can be more simple analysis. Generate transcript-length-normalized counts (RPK, RPKM or TPM). This will give you read density per gene. Then put all values in a table to look at genes with highest density as most expressed genes.

Between samples normalization is not necessary, but if you do RPKM or TPM, that is a form of between samples normalization anyways.

I'm not great with stats, but you can probably get fancy with tests to call "statistically enriched" genes based on the variance from your replicates. You are would be looking for genes with higher than average expression, so significantly different from the average.

You can also consider Z-score. Using length-normalized counts, you subtract the average count of all genes from the counts for each gene and divide by the standard deviation. This will give you a stat of how many far from the mean a gene is (e.g. a value of 1 = 1 standard deviation away from the mean). Then plot a heatmap to find genes with consistently high zscore.

(Xgene - MEANallgenes) / SDallgenes

Many of the most highly expressed genes will be basic housekeeping genes that many people don't find interesting, so you can consider filtering these out, maybe by determining the common genes in each cell sample.

Entering edit mode

Thank you for your answer!!! I think ranked list might work. Do I need to use calcNormFactors(y) to normalize the library size between the replicas, and then do RPKM(y)? I think RPKM and TPM scales the count data by gene length, so it doesn't adjust the library size.


Login before adding your answer.

Traffic: 2158 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6