Question: n.sv number for batch effects in RNA-seq
3
sophialovechan • 70 wrote:
Hi everyone, I am analyzing different RNA seq data from different labs so I want to adjust the batch effects. I am using sva+DESeq2 to plot PCA. But I have a question about the sva settings in RNA seq data analysis. 1. What is the number for n.sv when you conduct this step ( svseq <- svaseq(dat, mod, mod0, n.sv = 2)? 2. People generally use 2. Is it the number of your samples? 3. In sva tutorial, there is a step to estimate the n.sv. But that step is for microarray. Can I just apply the same scripts to get n.sv for my RNA seq data? Thanks in advance.
No it is the number of surrogate variables that you are applying to ad as confounding factors that will be modelled to create the new model matrix for DE. Probably take a look at this Biostar post where I have explained about the use of SVA in detail. Comes with some handy code snippet and other SVA post links as well. If you have a problem using SVA with DESeq2 hit me up will try to help.
Hi Thank you very much. I read that post. So basically, you can try to use different surrogate variables to adjust your data and then decide which one make more sense. Right? But I have a question about one thing you mentioned, as quoted below, in that post.
"If you do a plot and do not see variability due to batches in data in PCA plot based on batches then no point doing it even if you know there were batches. This will over fit"
How can you tell whether you have batches just based on PCA plot?
If you say you have data incoming from different labs, you can plot the PCA with colors according to the labs and see if the PC1 and PC2 project the variability of the labs or not. If the batch variability is more over your biological variability it can be seen in your PCA and the most likely it will be captured in the first 2 components. Batches can due to several factors and that is the reason one performs the SVA analysis but the PCA pattern does not give an understanding of your biological variation that you are trying to address.
ok. Thanks for your suggestion. But I don't think it apples to my samples. Those samples are from different lab and what I want to know is whether the cells generated from different labs are identical and the cells we generated are closed to human primary cells. Batch effects are one variables when I compare different samples, I guess. I don't know how much the batch effects are going to affect my final conclusion about the clustering patterns though. Do you think I can transform my RNAseq data into microarray data and use the algorithm in SVA to estimate n.sv?
Wait a moment. There are too many things in your reply. First of all PCA should account for most of the variability. If it doesn't as you might feel you have multi dimensions then go for MDS plot. Whatever you do if the batch due to different labs have higher variability than your biological variability then that will be captured in PCA. Then you can use SVA to model around those batch effects . However that can be done with combat as well where you know which is the major source of batch effect. If your PCA or MDS also has other variability or patterns that might be technical then SVA is an option to find total number of SVs and try to make a trade off at which SVA you can find the biological difference that is likely to arise. If your data is RNASeq then you can see post normalization of your and other lab data through PCA. It will give you idea about batch correction if they are showing. Then use any standard batch correction method. If your data is microarray it is not a good way to put microarray data with RNASeq. There are some other methods but not this. Completely different technology. Expression estimates are better in RNAseq and a linear comparison between them is actually not the best way.
Hi vchris_ngs,
Maybe you can comment on this suggestion: if the batches (different labs) are known wouldn't it be better to specify those as covariates in the model for differential expression analysis? As you mentioned overfitting is a potential danger of using batch correction tools.
The fact is batch correction is something which is always having a chance of over fitting. If one can see the batches they already know then either you can use combat to reduce that effect or use SVA to use them as covariates to model the DE testing. Now you have to test both in order to see which gives a better biological variability post implementation of either of the two. I prefer SVA since am always pretty sure that one cannot be knowing every confounders a priori and I don't like to use combat. Preferably I use them as covariates taking the SVs in the model as a way to reduce their impact and do my DE testing . Batches is not always between labs. It can be also between machines, different people and so on. So you might encounter multiple patterns and the biggest variability will be captured. But right now am concerned if the OP has RNASeq data or not. If no then normalization of public data is fine. Using of batch corrected normalization is also OK. But OP can't simply put microarray samples with the public or other lab RNASeq data .
Thanks for elaborating. I also meant to specify the covariates in the design formula of e.g. DESeq2 as in (hypothetically):
@WouterDeCoster This is a classic example of the multifactorial design which at times models the covariates but actually not using the batch correction. It might not always work but if you find the SVs you can use their information in the design and then create your new model for DESeq2 analysis.
Usually the designs are as well (simplest) : You do not specficy any batches.
Let me know if its clear or not.
Thanks! Certainly clear enough.
Hi vcrhis_ngs,
In your comment above, which package is the "plPCA" function from, I can't seem to find it anywhere. Or if it was a typo, what's the actual function/package name you used to draw your corrected PCA?
Thanks,
hi stan, Did you find the answer? i could not find the function "plPCA" either.
Hi both,
Apologies for the late reply. I have replied to it here.
This tutorial has the necessary details. It uses both SVA and DESeq2. Also links earlier I have posted have the code snippet that this tutorial misses which can give you an idea how model the covariates into your design. Sorry am on mobile phone so only can scribble this much. If you have problems reach out I will redirect to a proper code later if you cannot still figure it out. Just detail where you get the error and what you cannot achieve. Good luck!
Thanks, but my question was theoretical concerning the best approach to handle a covariate. So no errors for now ;)
Here is my scripts. I rund num.sv but it gave me n.sv =0. Everything run normally but I don't know whether I should trust n.sv number here and skip the batch effect step. Can you give me some hints? That is how I named my files, cell_type_counts.
you are looking for n.sn there and not n.sv which you have used. However, the code is fine unless you check if the value of n.sv=0 as well. If so then congratulations no confounding factors. It will be now important to see your PCA for me to really tell if you have batches or not and if the PCA shows segregations or clusters based on lab then SVA leek method is not able to find them.
sorry, I miss type it. n.sv is 0. What other methods are available? I didn't see it in the tutorial. I still don't understand how you can tell whether there are batch effects based on PCA. What I want to know from PCA is that whether the cell types generated from different labs are similar to each other and whether our cells are closed to primary human cells. We used stem cells and have our own protocol to generate certain cell types. So I don't know whether they should be clustered together or not. That's the question I want to know. If I know they should be same, I understand that the big distances among my samples result from batch effects. But the problem is I don't know they are similar or not. Here is one plot I have. Thank you
unavailable, PCA plot
Saw your PCA,
I can only see 4 samples in the plot. Does that include your sample as well? What does the color indicate , different labs? If so then you can see clearly data are separated by labs. Just one thing why is just 4 samples? Is it that other lab data is not having replicates and also your data? In that case nothing holds. You will not be able to achieve much from such analysis or PCA as well. Usually for any DE analysis you need 3*3 (3 samples or replicates per condition to have effective statistically viable results)
Plot link has been fixed.
Ok, in that case, SVA does not find any batch variables in the RNA-Seq data from different labs (public datasets). Confirm me certain things:
Post this all batch variable analysis and everything can be thought of. Let me know if its clear or not.
The plot link has been fixed. Those four cells are all from different lab. I generated the count matrix which contained all the data from different labs. I put all the counting files in one folder and then use sub("*_counts") to include all counting data I think what I did when I plotted my PCA is what you basically suggested here.
Are there any other methods, except for "leek"?
actually no. Palecek_Epi and hEpi are from one lab and MLC_old and MLC_new are from one lab. But they are claimed to be the same cell type.
Coming to your reply, PC1 is able to show the variability based on 2 different labs.
However, it is just 4 samples so it will not really be very informative. Since PCA is better for when you have multiple samples.
Clearly same labs also have variability that is in your PC2.
Now if you want to understand how your RNA-Seq behaves your count matrix should have your samples listed as well in that data set and then make the PCA.
TBH, these are very few samples to make bigger biological claims but as of now what I can say, PC1 captures lab variability and PC2 variability of the sample within the same lab(same samples have 25% variance).
I hope it is clear now.
Yes. Thank you for your explanation. I think the interpretation of PC1 and PC2 depends on how to make the interpretation reasonable based on your understanding about samples. So what I should do now is to minimize the batch effects from different labs to make the first variance (PC1) stands for the intrinsic biological variance instead. Am I right? Like this example, I need to decrease the 64% PC1, since it results from batch effects. What I know about my data is that, regardless of technical variance (sequencing depth), there are two main variances: different labs and different cell types. So what I want to get is that variance from cell types should be on PC1 axis and account for the majority of variances but batch effects variance should be on PC2 axis and that's the second variance among my sample. Am I right?
First of all, you have very few samples to make such claims, you have to put your RNA-Seq data along with these samples and make the plot. Then only you can derive a hypothesis. You cannot simply make a batch correction without seeing how your data behaves with these 4 samples. This is not a very trustworthy claim which you are saying based on just 4 samples. Points to note: These are only 4 samples so nothing much to say.
It is actually very tricky to get a biological conclusion from PCA plot. If I have more samples which are completely distinct from these four samples and then I plot them together, you will see the variance among them will largely decrease relative to what they have now. I am a biologiest, so I want to truly understand what is going on in the PCA plot and be confident about my conclusion here.
I have to tell you something which you might not like having said what your current PCA is stating it is pretty clear but not trustworthy owing to less power of the data it is not possible to draw such conclusions as you are stating:
You are just using features of DESeq2 not the actual tool. PCA makes you understand which is the variance and if that variance is compatible with a biological hypothesis or not. If not then one needs to understand what other factors are drawing such a variance and model their hypothesis based on that, this is a data driven hypothesis.
If you do not plot your RNA-Seq data you will not be able to make any conclusion not biologically nor mathematically, any reviewer would trash it if you want to make claims on 4 samples since it will not be a robust claim and also you cannot claim anything on your cell types unless you project them in the orthogonal space with the other cell types coming from different labs.
I revisited DESeq2 manual again. I was doubt whether it is appropriate for me to use DESeq2 to get the information I need. It seems to me that the PCA plot in DESeq2 is just to give us an idea about what kind of variance might influence the final differential gene analysis. What I want to do is the clustering analysis. I want to see how close or how distant all of my samples are.
Understand what is clustering (unsupervised and supervised) and why it is such a use in RNA-Seq and why is PCA needed in RNASeq
Some comments:
Let's close this thread.
I recommend you to read something on lines of below:
Batch, no-batch will all make sense once you understand the real utility of the RNA-Seq method, will help you in designing better sequencing designs for future.
Good Luck!
Thank you very much for your explanation. I will dig into the points you gave here. But I have one technical question, which might be easy to answer. When we start to perform SVA, the manual shows that we assign dat as count(dds, normalized = TRUE). I am wondering which dds is here. I saw different people use different things. There are two different ones. dds <- DESeq(dds) or dds <- estimateSizeFactors(dds). I know estimateSizeFactors function is in DESeq. But there are also other functions in DESeq(). Just want to maker sure.
This query should be available in biostars search and also in google, if you cannot find it then raise a separate question. It is better to not go ahead with this thread. Also if you follow my code and the links in this post you will be able to figure it out. Good luck!
I am sorry that I am new to this and trying to understand the meaning of my results. Let's pose a second about the discussion. I have some basic questions, maybe stupid and naive. When I plot PCA for RNA seq data, it is a two dimensional plot, which means it plots two variables (PC1,PC2). The PCA plot tells me that based on these two variables, how different my cell types are. I don't understand which two variables PC1 and PC2 indicate. Followed by PC1 and PC2, there are two numbers, 24% variance and 42% variance. Are these numbers indicating how much weight these two variables for my sample differences.
Mathematically it is pretty difficult how PCA works, but the following intuition is helpful: Roughly: in a multidimensional dataset (one dimension for every gene!) PCA tries to look for the 2D projection which captures the most variability in your dataset. As such it reduces the complexity of your dataset to a 2D picture containing as much information as possible about the dataset.
Indeed, the percentages indicate the amount of variation captured by the PCs. So PC1 captures 42% of all variability in your data!
For a more detailed explanation, I refer to this StatQuest youtube movie.
*
* This is the worrisome part. The cross platform normalization is tricky as well. You have to tell me clearly if you are using microarray and comparing them to other lab RNA-Seq or not by trying to put them in the same matrix. It will be a bit of a problem, there is one paper but I have not tested. There are claims but not very robust ones. I do not recall any method as of now that does that or any package.
As I understood OP has RNA-seq and would like to use a method to estimate n.sv for sva.
No, I am not using any microarray data, all my data are RNA sequencing data. I am just thinking if I can transform them into microarray data to get an idea about n.sv.
You do not have to do any major transformation.
If it's all RNAseq then it should not be of any problem. You can use SVA for RNASeq. Check the usage of SVA on RNAseq. Jtleek blog. Also I posted a biostar link. That should give you enough information how to use batch correction via SVA for RNAseq.
Yeah. That's what I actually did previously. I just want to know what number of n.sv is. I know for microarray, you can use the num.sv() function to estimate the n.sv. But how can I apply this for RNA-seq or do I need to estimate n.sv before I fun SVA?
I plotted PCA before I did SVA and after I did SVA. I didn't see much difference. before SVA, PC1 variance 41%, PC2 variance 24% after SVA, PC1 variance 42%, PC2 variance 24%
Hi sophialovechan,
Don't forget to follow up on your previous questions.
If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Interesting guidelines for posting can be found in the following posts: