n.sv number for batch effects in RNA-seq
1
3
Entering edit mode
5.4 years ago

Hi everyone, I am analyzing different RNA seq data from different labs so I want to adjust the batch effects. I am using sva+DESeq2 to plot PCA. But I have a question about the sva settings in RNA seq data analysis. 1. What is the number for n.sv when you conduct this step ( svseq <- svaseq(dat, mod, mod0, n.sv = 2)? 2. People generally use 2. Is it the number of your samples? 3. In sva tutorial, there is a step to estimate the n.sv. But that step is for microarray. Can I just apply the same scripts to get n.sv for my RNA seq data? Thanks in advance.

RNA-Seq • 6.9k views
1
Entering edit mode

Hi sophialovechan,

Interesting guidelines for posting can be found in the following posts:

2
Entering edit mode
5.4 years ago
ivivek_ngs ★ 5.2k

No it is the number of surrogate variables that you are applying to ad as confounding factors that will be modelled to create the new model matrix for DE. Probably take a look at this Biostar post where I have explained about the use of SVA in detail. Comes with some handy code snippet and other SVA post links as well. If you have a problem using SVA with DESeq2 hit me up will try to help.

0
Entering edit mode

Hi Thank you very much. I read that post. So basically, you can try to use different surrogate variables to adjust your data and then decide which one make more sense. Right? But I have a question about one thing you mentioned, as quoted below, in that post.

"If you do a plot and do not see variability due to batches in data in PCA plot based on batches then no point doing it even if you know there were batches. This will over fit"

How can you tell whether you have batches just based on PCA plot?

2
Entering edit mode

If you say you have data incoming from different labs, you can plot the PCA with colors according to the labs and see if the PC1 and PC2 project the variability of the labs or not. If the batch variability is more over your biological variability it can be seen in your PCA and the most likely it will be captured in the first 2 components. Batches can due to several factors and that is the reason one performs the SVA analysis but the PCA pattern does not give an understanding of your biological variation that you are trying to address.

0
Entering edit mode

ok. Thanks for your suggestion. But I don't think it apples to my samples. Those samples are from different lab and what I want to know is whether the cells generated from different labs are identical and the cells we generated are closed to human primary cells. Batch effects are one variables when I compare different samples, I guess. I don't know how much the batch effects are going to affect my final conclusion about the clustering patterns though. Do you think I can transform my RNAseq data into microarray data and use the algorithm in SVA to estimate n.sv?

1
Entering edit mode

Wait a moment. There are too many things in your reply. First of all PCA should account for most of the variability. If it doesn't as you might feel you have multi dimensions then go for MDS plot. Whatever you do if the batch due to different labs have higher variability than your biological variability then that will be captured in PCA. Then you can use SVA to model around those batch effects . However that can be done with combat as well where you know which is the major source of batch effect. If your PCA or MDS also has other variability or patterns that might be technical then SVA is an option to find total number of SVs and try to make a trade off at which SVA you can find the biological difference that is likely to arise. If your data is RNASeq then you can see post normalization of your and other lab data through PCA. It will give you idea about batch correction if they are showing. Then use any standard batch correction method. If your data is microarray it is not a good way to put microarray data with RNASeq. There are some other methods but not this. Completely different technology. Expression estimates are better in RNAseq and a linear comparison between them is actually not the best way.

1
Entering edit mode

Hi vchris_ngs,

Maybe you can comment on this suggestion: if the batches (different labs) are known wouldn't it be better to specify those as covariates in the model for differential expression analysis? As you mentioned overfitting is a potential danger of using batch correction tools.

1
Entering edit mode

The fact is batch correction is something which is always having a chance of over fitting. If one can see the batches they already know then either you can use combat to reduce that effect or use SVA to use them as covariates to model the DE testing. Now you have to test both in order to see which gives a better biological variability post implementation of either of the two. I prefer SVA since am always pretty sure that one cannot be knowing every confounders a priori and I don't like to use combat. Preferably I use them as covariates taking the SVs in the model as a way to reduce their impact and do my DE testing . Batches is not always between labs. It can be also between machines, different people and so on. So you might encounter multiple patterns and the biggest variability will be captured. But right now am concerned if the OP has RNASeq data or not. If no then normalization of public data is fine. Using of batch corrected normalization is also OK. But OP can't simply put microarray samples with the public or other lab RNASeq data .

1
Entering edit mode

Thanks for elaborating. I also meant to specify the covariates in the design formula of e.g. DESeq2 as in (hypothetically):

~ lab + gender + treatment

1
Entering edit mode

@WouterDeCoster This is a classic example of the multifactorial design which at times models the covariates but actually not using the batch correction. It might not always work but if you find the SVs you can use their information in the design and then create your new model for DESeq2 analysis.

Usually the designs are as well (simplest) : You do not specficy any batches.

mod <- model.matrix(~ dex, colData(dds))
mod0 <- model.matrix(~ 1, colData(dds))

n.sv <- num.sv(dat,mod,method="leek") ##gives 13

## using 2 SV for the start
lnj.corr <- svaBatchCor(dat,mod,mod0,n.sv=2) # svaBatchCor this function i have given earlier a link
co <- lnj.corr$corrected plPCA(co) ## gives you the corrected PCA with SVA correction ## DEA #new design design <- model.matrix(~sv$sv+dex)
design
#pushing the above in deseq2
ddssva <- dds
ddssva$SV1 <- sv$sv[,1]
ddssva$SV2 <- sv$sv[,2]
design(ddssva) <- ~ SV1 + SV2 + dex ## here you are using them as covariates and adding them in the design
ddssva <- DESeq(ddssva)
res <- results(ddssva)
res


Let me know if its clear or not.

0
Entering edit mode

Thanks! Certainly clear enough.

0
Entering edit mode

Hi vcrhis_ngs,

In your comment above, which package is the "plPCA" function from, I can't seem to find it anywhere. Or if it was a typo, what's the actual function/package name you used to draw your corrected PCA?

Thanks,

0
Entering edit mode

hi stan, Did you find the answer? i could not find the function "plPCA" either.

0
Entering edit mode

Hi both,

Apologies for the late reply. I have replied to it here.

0
Entering edit mode

This tutorial has the necessary details. It uses both SVA and DESeq2. Also links earlier I have posted have the code snippet that this tutorial misses which can give you an idea how model the covariates into your design. Sorry am on mobile phone so only can scribble this much. If you have problems reach out I will redirect to a proper code later if you cannot still figure it out. Just detail where you get the error and what you cannot achieve. Good luck!

0
Entering edit mode

Thanks, but my question was theoretical concerning the best approach to handle a covariate. So no errors for now ;)

0
Entering edit mode

Here is my scripts. I rund num.sv but it gave me n.sv =0. Everything run normally but I don't know whether I should trust n.sv number here and skip the batch effect step. Can you give me some hints? That is how I named my files, cell_type_counts.

>library(DESeq2)
>dir <- getwd()
>sampleFiles <- grep("counts", list.files(dir), value = TRUE)
>sampleCondition <- sub("(.*)_.*_counts","\\1", sampleFiles)
> sampleCondition
[1] "CM"        "DE"        "ESC"       "hBM_MSC"   "hPericyte"
[6] "MSC"
> condition <- sampleCondition
>sampleTable <-data.frame(sampleName=sampleFiles,fileName=sampleFiles,condition=sampleCondition)
condition <- sampleCondition
>ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory = dir,design = ~ condition)
>dds <- ddsHTSeq [rowSums (counts(ddsHTSeq))>1,]
>dds <- DESeq(dds)
library(sva)
>dat <- counts(dds, normalized = TRUE)
>idx <- rowMeans(dat) >1
>dat <- dat[idx,]
>mod <- model.matrix(~condition,colData(dds))
>mod0 <- model.matrix(~1, colData(dds))
>n.sv <- num.sv(dat,mod,method="leek")
>n.sn
[1] 0

0
Entering edit mode

you are looking for n.sn there and not n.sv which you have used. However, the code is fine unless you check if the value of n.sv=0 as well. If so then congratulations no confounding factors. It will be now important to see your PCA for me to really tell if you have batches or not and if the PCA shows segregations or clusters based on lab then SVA leek method is not able to find them.

0
Entering edit mode

sorry, I miss type it. n.sv is 0. What other methods are available? I didn't see it in the tutorial. I still don't understand how you can tell whether there are batch effects based on PCA. What I want to know from PCA is that whether the cell types generated from different labs are similar to each other and whether our cells are closed to primary human cells. We used stem cells and have our own protocol to generate certain cell types. So I don't know whether they should be clustered together or not. That's the question I want to know. If I know they should be same, I understand that the big distances among my samples result from batch effects. But the problem is I don't know they are similar or not. Here is one plot I have. Thank you

0
Entering edit mode

unavailable, PCA plot

0
Entering edit mode

I can only see 4 samples in the plot. Does that include your sample as well? What does the color indicate , different labs? If so then you can see clearly data are separated by labs. Just one thing why is just 4 samples? Is it that other lab data is not having replicates and also your data? In that case nothing holds. You will not be able to achieve much from such analysis or PCA as well. Usually for any DE analysis you need 3*3 (3 samples or replicates per condition to have effective statistically viable results)

2
Entering edit mode

0
Entering edit mode

Ok, in that case, SVA does not find any batch variables in the RNA-Seq data from different labs (public datasets). Confirm me certain things:

• Does you RNA-Seq count matrix on has the data from only other labs or also your labs?
• If you do PCA only on other labs you will not be able to understand if your data is separated by batch or not. You can only see if there is a batch between the public datasets (since they are multiple datasets)
• If you want to how your lab RNA-Seq data juxtapose and behave with the public RNA-Seq data then you have to create a count matrix that will have both public RNASeq and your datasets. The code you showed earlier does not have your RNA-Seq datasets I reckon so you will not be able to understand how your data is behaving.
• If you want to do that then take the fastq files of public and your data, run a quantification and create a count matrix with genes rows and samples on columns (samples= public RNAseq from different labs + your lab RNASeq)
• Now make PCA on this both at count data and normalized data and label the points with labs. You will be able to see clusters and pretty much identify if there is a batch effect or not.

Post this all batch variable analysis and everything can be thought of. Let me know if its clear or not.

0
Entering edit mode

The plot link has been fixed. Those four cells are all from different lab. I generated the count matrix which contained all the data from different labs. I put all the counting files in one folder and then use sub("*_counts") to include all counting data I think what I did when I plotted my PCA is what you basically suggested here.

Are there any other methods, except for "leek"?

0
Entering edit mode

actually no. Palecek_Epi and hEpi are from one lab and MLC_old and MLC_new are from one lab. But they are claimed to be the same cell type.

1
Entering edit mode

Coming to your reply, PC1 is able to show the variability based on 2 different labs.

64% vairbility between lab1= Palecek_Epi + hEpi and lab2 = MLC_old + MLC_new

Now the PC2=25% internally captures the variability of samples also generated in the same lab as well. That is why they are far apart.


However, it is just 4 samples so it will not really be very informative. Since PCA is better for when you have multiple samples.

Clearly same labs also have variability that is in your PC2.

Now if you want to understand how your RNA-Seq behaves your count matrix should have your samples listed as well in that data set and then make the PCA.

TBH, these are very few samples to make bigger biological claims but as of now what I can say, PC1 captures lab variability and PC2 variability of the sample within the same lab(same samples have 25% variance).

I hope it is clear now.

0
Entering edit mode

Yes. Thank you for your explanation. I think the interpretation of PC1 and PC2 depends on how to make the interpretation reasonable based on your understanding about samples. So what I should do now is to minimize the batch effects from different labs to make the first variance (PC1) stands for the intrinsic biological variance instead. Am I right? Like this example, I need to decrease the 64% PC1, since it results from batch effects. What I know about my data is that, regardless of technical variance (sequencing depth), there are two main variances: different labs and different cell types. So what I want to get is that variance from cell types should be on PC1 axis and account for the majority of variances but batch effects variance should be on PC2 axis and that's the second variance among my sample. Am I right?

0
Entering edit mode

First of all, you have very few samples to make such claims, you have to put your RNA-Seq data along with these samples and make the plot. Then only you can derive a hypothesis. You cannot simply make a batch correction without seeing how your data behaves with these 4 samples. This is not a very trustworthy claim which you are saying based on just 4 samples. Points to note: These are only 4 samples so nothing much to say.

• Depends on how your samples behave so you have to put these 4 samples + your samples and make the PCA
• Then if the variance is still high you have to see to what extent it is high. If its still in the order of 50 or 60% I doubt if the batch correction will help.
• You cannot simply enforce a mathematical method to see your own hypothesis, this will manipulate and overfit the data and it is not biologically correct nor even mathematically. Biologists might beg to differ but this is the truth, 60% variance is difficult to let go.
• You cannot make claims without putting your data in the same orthogonal space. So do that first and see the PCA plot then only anything can be hypothesized.
0
Entering edit mode

It is actually very tricky to get a biological conclusion from PCA plot. If I have more samples which are completely distinct from these four samples and then I plot them together, you will see the variance among them will largely decrease relative to what they have now. I am a biologiest, so I want to truly understand what is going on in the PCA plot and be confident about my conclusion here.

0
Entering edit mode

I have to tell you something which you might not like having said what your current PCA is stating it is pretty clear but not trustworthy owing to less power of the data it is not possible to draw such conclusions as you are stating:

• Sequencing technology is digital read outs. They have assumption and errors while the more samples you have the robust are your data since one cannot deny that while preparing the library different person making it, efficiency and error rates have to be taken into consideration. More the number of samples less is the error rate.
• You cannot claim anything based on one sample, it is not viable especially when your transcriptomics data unless you are using a single cell that will be more personalized for each sample. So you need replicates for your condition to hold a biological hypothesis over a population.
• You are just using features of DESeq2 not the actual tool. PCA makes you understand which is the variance and if that variance is compatible with a biological hypothesis or not. If not then one needs to understand what other factors are drawing such a variance and model their hypothesis based on that, this is a data driven hypothesis.

If you do not plot your RNA-Seq data you will not be able to make any conclusion not biologically nor mathematically, any reviewer would trash it if you want to make claims on 4 samples since it will not be a robust claim and also you cannot claim anything on your cell types unless you project them in the orthogonal space with the other cell types coming from different labs.

• So sit with a bioinformatician who is experienced in statistics and knows how to effectively use RNASeq data. You will need to understand the concept of replicates, robustness in a method and why power is calculated apriori sequencing. One should not sequence without consulting a bioinformatician. Good luck!
0
Entering edit mode

I revisited DESeq2 manual again. I was doubt whether it is appropriate for me to use DESeq2 to get the information I need. It seems to me that the PCA plot in DESeq2 is just to give us an idea about what kind of variance might influence the final differential gene analysis. What I want to do is the clustering analysis. I want to see how close or how distant all of my samples are.

1
Entering edit mode

Understand what is clustering (unsupervised and supervised) and why it is such a use in RNA-Seq and why is PCA needed in RNASeq

• PCA is giving an understanding of factors that capture the maximum variance to make you a hypothesis and then if it fits your biological one you will use DE analysis either with DESeq2/edgeR/limma-voom.
• Now your data is not robust in terms of replicates and so whatever claims you make are not trustworthy.
• You are not using DESeq2 yet since you have not done any differential expression analysis.
• You are just using PCA here. If clustering is what you want to see then PCA, MDS, unsupervised dendrogram, anything can do the job.
• DE analysis is a consequence of what one does if they know that having 2 conditions shown capturing variances in PCA or any clustering, what are the actual factors (here genes) that are giving rise to such differences. Find those factors and see to what extent they are driving the differences and are they functionally relevant for the kind of data one is interrogating. There are a lot more to DE analysis motivation and conclusions that can be worked upon and done. I am just stating a handful.

I recommend you to read something on lines of below:

• RNA-Seq data analysis workflow and why one needs replicates
• Understanding of PCA in context of RNA-Seq
• Why we do differential expression analysis and what methods are currently in use?

Batch, no-batch will all make sense once you understand the real utility of the RNA-Seq method, will help you in designing better sequencing designs for future.

Good Luck!

0
Entering edit mode

Thank you very much for your explanation. I will dig into the points you gave here. But I have one technical question, which might be easy to answer. When we start to perform SVA, the manual shows that we assign dat as count(dds, normalized = TRUE). I am wondering which dds is here. I saw different people use different things. There are two different ones. dds <- DESeq(dds) or dds <- estimateSizeFactors(dds). I know estimateSizeFactors function is in DESeq. But there are also other functions in DESeq(). Just want to maker sure.

0
Entering edit mode

This query should be available in biostars search and also in google, if you cannot find it then raise a separate question. It is better to not go ahead with this thread. Also if you follow my code and the links in this post you will be able to figure it out. Good luck!

0
Entering edit mode

I am sorry that I am new to this and trying to understand the meaning of my results. Let's pose a second about the discussion. I have some basic questions, maybe stupid and naive. When I plot PCA for RNA seq data, it is a two dimensional plot, which means it plots two variables (PC1,PC2). The PCA plot tells me that based on these two variables, how different my cell types are. I don't understand which two variables PC1 and PC2 indicate. Followed by PC1 and PC2, there are two numbers, 24% variance and 42% variance. Are these numbers indicating how much weight these two variables for my sample differences.

2
Entering edit mode

Mathematically it is pretty difficult how PCA works, but the following intuition is helpful: Roughly: in a multidimensional dataset (one dimension for every gene!) PCA tries to look for the 2D projection which captures the most variability in your dataset. As such it reduces the complexity of your dataset to a 2D picture containing as much information as possible about the dataset.

Indeed, the percentages indicate the amount of variation captured by the PCs. So PC1 captures 42% of all variability in your data!

For a more detailed explanation, I refer to this StatQuest youtube movie.

0
Entering edit mode

*

Do you think I can transform my RNAseq data into microarray data and use the algorithm in SVA to estimate n.sv?

* This is the worrisome part. The cross platform normalization is tricky as well. You have to tell me clearly if you are using microarray and comparing them to other lab RNA-Seq or not by trying to put them in the same matrix. It will be a bit of a problem, there is one paper but I have not tested. There are claims but not very robust ones. I do not recall any method as of now that does that or any package.

0
Entering edit mode

As I understood OP has RNA-seq and would like to use a method to estimate n.sv for sva.

In sva tutorial, there is a step to estimate the n.sv. But that step is for microarray. Can I just apply the same scripts to get n.sv for my RNA seq data?

0
Entering edit mode

No, I am not using any microarray data, all my data are RNA sequencing data. I am just thinking if I can transform them into microarray data to get an idea about n.sv.

0
Entering edit mode

You do not have to do any major transformation.

If it's all RNAseq then it should not be of any problem. You can use SVA for RNASeq. Check the usage of SVA on RNAseq. Jtleek blog. Also I posted a biostar link. That should give you enough information how to use batch correction via SVA for RNAseq.

0
Entering edit mode

Yeah. That's what I actually did previously. I just want to know what number of n.sv is. I know for microarray, you can use the num.sv() function to estimate the n.sv. But how can I apply this for RNA-seq or do I need to estimate n.sv before I fun SVA?

I plotted PCA before I did SVA and after I did SVA. I didn't see much difference. before SVA, PC1 variance 41%, PC2 variance 24% after SVA, PC1 variance 42%, PC2 variance 24%