Question

DGE from tumor-adjacent normal pair RNA-seq data. For an individual, no replicate

0

Entering edit mode

2.6 years ago

field654 ▴ 30

Update. problem solved. Let me re-instate my question. I have tumor-normal paired RNA-seq data from multiple patients. However, for each tissue, I got no replicate. I'd like to perform tumor-normal for each individual and have some credibility indication, since I have multiple healthy tissue samples. I turned to Dr. Smyth, the author of EdgeR package. It turned out the my application can be performed.

The preparation work has been discribed in the UserGuide Section 4.1 where a overall comparison was made between the tumor and healthy samples from 3 patients. The BCV generated from the population would be applied. So after the global BCV's calculated, one could simply build a new design matrix by

design.patientspecific <- model.matrix(~Patient+Patient:Tissue)

Later on the fit and comparison can be made by

fit.patientspecific <- glmFit(y, design.patientspecific)
lrt.patient1 <- glmLRT(fit.patientspecific, coef=4)
lrt.patient2 <- glmLRT(fit.patientspecific, coef=5)
lrt.patient3 <- glmLRT(fit.patientspecific, coef=6)

The original question:

Hi dear folks. I apologize if I asked a repeated question. I have RNA-seq data from individuals that were acquired from paired tumor/adjacent tissues. I thought the DGE analysis should be straight-forward but it's not. Almost all pipelines I learned about require a biological repeat which I don't have. Of course I have data from other individuals but my focus is not on cross-comparing. I'm interested in the DGE for one individual only and that's to validate the MSMS data acquired from the same person.

I found only a few publications with similar applications. I wonder if you can recommend a pipeline for data analysis. I'm not sure, at this point, if I have enough data for a student T test and a p value.

Thank you so much. Field

Update, I wonder if I can take advantage of the data from multiple patients and apply the combined tagwise dispersion to one single patient. I will post this query on Bioconductor as well. If a clear solution's obtained, I will update this thread.

RNA-seq DGE statistics tumor-normal • 2.6k views

ADD COMMENT • link 2.6 years ago by field654 ▴ 30

0

Entering edit mode

@ ATpoint @dariober

I'm not sure why the comments from you are now invisible. Thank you both for your kindly reply. It took me a while to familiarize edgeR. I went through part of the manual, particularly the 2.12 (no replicate) and the 4.1 (example on carcinoma patients) sections. I understand the parameters verbally but not always mathmatically.

Let me know if my thought is valid. I'd like to take advantage that I have samples from multiple patients, so I can estimate dispersions more accurately. The more samples pairs I throw in, the bettter tagwise dispersion I would get. I wonder if I can keep the tagwise table (don't see why I can't) and apply that to each patient's data.

Also I don't fully understand the mechanisms of qCML and Glm modes. Are they functional alternatives? Should I prefer one over another for my application?

Thank you so much. Field

ADD REPLY • link 2.6 years ago by field654 ▴ 30

score 1 · Answer 1 · 2021-09-13

1

Entering edit mode

2.6 years ago

SUMIT ▴ 30

Single sample analysis can be done in edgeR, while it is deprecated in DNASeq2 in 2018 probably (if you can use old DESeq version then it will work in single sample also).

For egdeR you just do

library(edgeR)
setwd()
rawdata <- read.delim("filename", check.names=FALSE, stringsAsFactors=FALSE)
ngenes <- 10000 #no. of genes
nsamples <- 2 #no. of sample
Counts <- matrix(rnbinom(ngenes*nsamples,mu=5,size=2),ngenes,nsamples) 
rownames(Counts) <- rawdata[,1]
y <- DGEList(counts=Counts, group=c("Normal", "Cancer"))
y$samples
y<-calcNormFactors(y)
y$design <- model.matrix(~group,y$samples)
y<-estimateDisp(y, design=y$design)
y$fit<-glmQLFit(y, design=y$design)
topTags(glmQLFTest(y$fit,coef=2),n=Inf)
d <- topTags(glmQLFTest(y$fit,coef=2),n=Inf)
d
write.table(d, file = "Result.txt", row.names=TRUE, sep="\t")

Remember edgeR comparison is alphabetically like Cancer vs. Normal (C come before N). you can also use the design file and put it as design$condition. But for simplicity, I would recommend the log FC would work better than using any library. Just divide (take mean if more than one sample in a group) the tumor sample count from the normal sample and change it into log scale since all libraries are worthless due to the absence of outlier calling or p-value or p.adst etc.

ADD COMMENT • link 2.6 years ago by SUMIT ▴ 30

0

Entering edit mode

Hi Sumit, Thank you so much for posting your code. Since the work was assigned to a team member before, I would need some time to educate myself with edgeR. I tried to run the code but some error showed up that I need to debug.

The error I ran into says y<-estimateDisp(y, design=y$design) Warning message: In estimateDisp.default(y = y$counts, design = design, group = group, : No residual df: setting dispersion to NA

Not sure if I can debug by myself.

I have noticed that you imported data into "rawdata" but only used the gene names afterwards. I wonder if that's intended.

Best, Field

ADD REPLY • link 2.6 years ago by field654 ▴ 30

0

Entering edit mode

If you want to use edgeR and make up a dispersion estimate then look at its manual for the "no replicate" section which explains some concepts as suggested in the other answer. This is exploratory though and not reliable or suitable for publication. If you want information per patient then you should have created multiple libraries from the same patient, e.g. biopsies taken on different days. Your experimental design is not suitable to allow per-patient results. Frankly, one should consult an experienced analyst before running an experiment to ensure that the experimental layout is suitable to answer the question at hand.

ADD REPLY • link 2.6 years ago by ATpoint 81k

0

Entering edit mode

This is exploratory though and not reliable or suitable for publication

@ATpoint I can't remember if I had a similar exchange with you about this... Anyway... If you are careful with interpretation and analysis I think you can still get reasonable results. If most genes stay the same AND you focus on those with large differences (say logFC > 1 or 2) AND you can find a good explanation from background knowledge for those genes, then it's unlikely that they are an artifact. After all, tumour-normal variant detection is rarely done in replicates; in fact, most programs are not even designed to handle replicates (ok, it's not quite the same...).

ADD REPLY • link 2.6 years ago by dariober 14k

1

Entering edit mode

Variant calling is an entirely different process, the lab procedure alone (starting from DNA) is much less prone to technical bias as DNA is much more resistant to degradation whereas RNA is very susceptible to uniquiteous RNases, so I doubt this is a valid argument.

You can get any number of DEGs with FDR < 0.05 and FC beyond 2-fold depending on the bcv that the edgeR user guide suggests, there is no way to tell what the real value is.

library(airway)
library(edgeR)
data(airway)

#/ DGEList with two untreated samples, each forms a group:
y <- DGEList(counts=assay(airway)[,c(1,3)],
             group=c("A", "B"))
y <- calcNormFactors(y)

bcv <- seq(0.005, 0.5, 0.005)

res <- mclapply(bcv, mc.cores = 8, function(x){
  et <- exactTest(y, dispersion=x^2)$table
  et$fdr <- p.adjust(et$PValue, "BH")
  nrow(et[et$fdr < 0.05 & abs(et$logFC) > 1,])
})

plot(x=bcv, y=unlist(res), pch=20, main="DEGs as function of bcv")

enter image description here

Sure, if results make sense and can be validated, that is great. All I say is that this (imho) is not reliable. If you take the top 10 genes and make a story out of them (which might be great biology eventually) then the entire RNA-seq analysis as a whole is is still unreliable as the majority of calls can still be false. Does not mean one cannot get anything done with the data, it is just (as a whole) not reliable and should only serve for exploratory purposes, that is all I am saying.

ADD REPLY • link 2.6 years ago by ATpoint 81k

1

Entering edit mode

@ATpoint (+1) - if I sound obnoxious or repetitive it's because I'm confused and it helps me to put my thoughts in writing...

You can get any number of DEGs with FDR < 0.05 and FC beyond 2-fold depending on the bcv

This is not entirely correct, FC won't change with BCV, only the estimated variation and therefore the p-value/FDR will change. In your plot, genes on the right should be a subset of those on the left. In practice, one would have to look at the MAplot and choose conservative cutoffs for logFC, FDR, BCV, etc.

there is no way to tell what the real value is.

True, but this applies regardless of whether you have a single replicate or a handful; you need to make assumptions and have the background knowledge to interpret the results from such small experiments. In my opinion, assumption-free or data-driven discovery is untenable in most projects because the number of samples is far too small.

We agree that we need replicates - that's for sure. My point is that if you are happy with two or three replicates, then you should (?) be happy also with just one. In fact, I'm not condoning single-replicate experiments, I'm curbing the overinterpretation of handful-of-replicate ones.

At the bare minimum, the edgeR machinery can work with two replicates in one condition and a single replicate in the another condition. Intuitively, the design 2vs1 cannot be too much better than 1vs1 design, right? Even three replicates per condition give you very little power to capture the variability in the population. In other words, if you think of randomization as a strategy to break links between condition of interest and confounders, three replicates have very little power to break those links.

Imagine you sample 6 people entirely at random and no information is given about them. You assign three people to treatment and three to control; edgeR is going to tell you which genes are different. However, if you want to generalize to the effect of treatment-vs-control you need to make a number of assumptions because there are good chances that you are comparing 3 males vs three females or 3 youngs vs 3 olds or 3 good vs 3 bad preps. You need to assume that the variability in the source population is already very low (same sex, age, etc) and you probably need some knowledge to tell whether the detected genes are consistent with the treatment or they are due to confounders. If you can make those assumptions, then even a 1vs1 experiment is still a bit informative and, qualitatively, not too much worse.

In my opinion and to some extent experience, a single replicate experiment with good background for interpretation is better than an assumption-free experiment with a handful of replicates. Especially with more flimsy techniques like ATACseq and ChIPseq it's really difficult to draw conclusions without some backup. My 2p...

ADD REPLY • link 2.6 years ago by dariober 14k

0

Entering edit mode

No residual df: setting dispersion to NA

It is not a problem in single sample analysis. It is just a warning message. You can simply ignore it since you have only a single sample in the group.

I have noticed that you imported data into "raw data" but only used the gene names afterward. I wonder if that's intended.

Yes, it is required, you need to assign the gene no. Can you share your sample data table?

ADD REPLY • link 2.6 years ago by SUMIT ▴ 30

0

Entering edit mode

Hi Sumit,

Thank you so much for your reply. I did assigne the gene number to the column number of the data, so the assignment of rownames went smoothly. I encountered the error at the estimation dispersion step.

As I posted to reply ATpoint and Dariober, I'm trying to throw in data from more patients to get a more reliable array of tagwise dispersions. Working on it, not sure if that works.

The data table was shared below. Thank you. Field

https://www.dropbox.com/s/kdtx2dwazd9uz83/008_result.txt?dl=0

ADD REPLY • link 2.6 years ago by field654 ▴ 30

score 0 · Answer 2 · 2021-09-13

0

Entering edit mode

2.6 years ago

dariober 14k

See What to do if you have no replicates in the edgeR guide. Googling "edgeR no replicates" should also give you some useful pointers on how to proceed and interpret the results.

ADD COMMENT • link 2.6 years ago by dariober 14k

0

Entering edit mode

Thank you very much for your suggestion. As a nonprofessional bioinformatic performer, it took me a while to read the edgeR guide. Would you please check out the comment under the question? For some unknown reaosn the reply form you and ATpoint were gone so I can't reply to neither of you. Best. Field