Power of differential expression for a single gene analysis
0
1
Entering edit mode
3.5 years ago

Dear community members,

I have a situation where I have a patient and X controls (humans). I have an RNAseq from the same cell type of both patient/controls. RNAseq is sequenced uniformly in the same sequencing center to a modern standards of coverage.

I know that one gene in a patient is deregulated (up or down) in a quite radical fashion (expression is halved / increased by a factor of 1.5). I do not know the gene, so in the best case I have to check ~4K genes with the disease-causing phenotype, in the worst - all ~20K genes. I do know the patient (so I am testing 1-vs-all).

How many controls do I need to take to have a high power of expression deregulation detection in 1 patient? The gene is assumed to be quite well expressed in the dataset (middle to high).

In other words - how many samples I need to include into the estimation of expression distribution, so, given the natural variability of expression, I'd be able to specify several candidate genes with de-regulated expression which survived FDR correction.

I know that there are tens of additional factors that is required to know in addition to what I've said, so I am asking not for a strict estimation, but for your gut feeling.

RNA-Seq • 1.1k views
ADD COMMENT
1
Entering edit mode

Hi German, maybe you can do a PCA plot with all the samples and discard the outlier controls, and then use as many controls as left after removing the outliers (I think if you have more than 3 samples it is already ok, though of course having just 1 patient is a problem). how is life btw?

ADD REPLY
0
Entering edit mode

Hi Grant, not great not terrible, how is it in Barcelona?) yeap I do plan to perform the PCA - but I need to know how many controls to select...I am not sure if I have enough and need to ask to sequence more or maybe download some from GTEx and somehow normalize them...RNA-seq is quite a terra incognita for me...

ADD REPLY
0
Entering edit mode

"Not great not terrible" sounds worrying :) all is fine here :) If I got it right, your goal is to find a specific gene which is up/down-regulated in the patient, right? I think a bigger problem for you is having only 1 patient sample, and the number of controls doesn't matter that much if you have at least 3-4 (assuming no batch/confounding effects). So if it'd be possible to get at least 1 more patient data, than would be much more helpful for you I think. Also, if you will request sequencing of new samples make sure to include there your initial samples so you can properly account for the possible confounding batch effect. I am also just curious what do you know about that gene? Because probably you gonna get like a hundred(s) of dif. expressed genes, and then how you will choose that specific one?

ADD REPLY
0
Entering edit mode

Yeap, that's exactly the point =) I have a very rare Mendelian monogenic disease patient - I simply don't have any more for cases, it is just too rare. So there should be 0 differences in expression between this patient and other controls - except this particular gene (or maybe plus couple of directly interacting genes). So I would like to recruit enough controls that I am able to confidently say that 19.995 out of 20.000 genes are not differentially expressed (the expression value of the patient is somewhere within the distribution of the same gene in controls), but 5 genes look very deregulated. This is why it is different from a traditional group vs group comparison - I can't just run Deseq2 and check which genes are different between groups, I am hunting for one particular gene, so I can't even remove the batch effects...I just assume that having many controls will allow me to select only 5 or less diff expressed genes to manually investigate (then it will be a standard clinical approach, e.g. checking how this gene may cause this disease, etc).

ADD REPLY
1
Entering edit mode

I see, sounds like a fun and challenging project :) the thing is that you might see some variability in your controls caused by biological (gender, population, etc) + technical (batch and other crap) effects, so in any case you might need to (at least try) make the controls as homogeneous as possible. After that, in my understanding the question is narrowed to finding whether a single expression value belongs to the distribution of expression values in controls. I am not statistician, but I think you will need to calculate the prediction interval for a number of control samples which is feasible for you to have. this post might be helpful https://stats.stackexchange.com/questions/62634/does-this-single-value-match-that-distribution/62653

ADD REPLY
0
Entering edit mode

yeap, everything is correct =) I will maybe use GTEx samples as controls - I think only this dataset has enough material of different tissues, so I'll actually try to sequence the case as GTEx did =) I'll do the prediction intervals, but I don't know how many samples I need to make this prediction interval narrow enough to be able to keep my truly variable gene in the True Positive cohort while having not so many False Positives...I don't have an intuition of the variability of RNAseq on a single gene level =)

ADD REPLY
1
Entering edit mode

then maybe lets do backwards, ask you PI how much is he/she ready spend on sequencing :))

ADD REPLY
1
Entering edit mode

GTEx is free =) 700 samples but 700 may be an overkill

ADD REPLY
1
Entering edit mode

true, and I guess is you narrow it down to corresponding metadata variables it should be doable, so good luck man!

ADD REPLY
0
Entering edit mode

muchas gracias amigo =)

ADD REPLY

Login before adding your answer.

Traffic: 1197 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6