Question: Order of operations in RNAseq analysis
0
gravatar for crouch.k
9 months ago by
crouch.k10
crouch.k10 wrote:

Hi all,

I have some RNAseq data and I would like to take the results forward to build a network based on expression correlation (wgcna or similar).

I have several steps that I need to carry out:

  • normalise data
  • transform data
  • subset data (not everything in the dataset is useful for the intended outcome so I need to extract only the useful samples)
  • remove a known batch effect (I know I can model this for differential expression, but for building a network I think I need to remove it - correct me if I am wrong)

I am using DESeq2 for normalisation and transforming using the variance stabilising transform from the same package (as recommend in the wgcna manual).

I have noticed that the outputs of my exploratory analyses change depending on the order in which I carry out these steps, particularly PCAs. In most cases, the gross patterns in the data remain intact, but in some cases this is not true. My question is, what is the correct order in which to carry out these steps and why?

Currently, I am loading all of the available samples, normalising, transforming the normalised counts, removing the batch effect from the transformed data, then extracting the samples of interest. General discussion about the order in which these processes should be carried out is welcome, but specifically:

  1. Would it be more sensible to extract the samples I am interested in first, and then run the downstream steps only on the samples I am interested in? I imagine this would affect the output as the geometric mean across the samples would change.
  2. I am currently removing the batch effect from the vst-transformed data. Would I be better to remove the batch effect first from the normalised counts and then transform the batch-corrected data?

Thanks for your help

rna-seq deseq2 wgcna • 536 views
ADD COMMENTlink modified 9 months ago by Kevin Blighe39k • written 9 months ago by crouch.k10
3
gravatar for Kevin Blighe
9 months ago by
Kevin Blighe39k
Republic of Ireland
Kevin Blighe39k wrote:

If your batch effect is known, then you should just model it via its inclusion in the DESeq2 design model. A check of the PCA bi-plots post-normalisation and transformation (Edit: transformation with rlog() or vst() with blind=FALSE) will then reveal to you whether the modelling of batch was successful (or not) (because the effect should vanish). The other option is to directly modify your raw counts, i.e., prior to normalisation, in an attempt to adjust for batch in that way, but I believe that the majority of those who have experience in this area would recommend to just model the batch effect, as I mentioned above.

If your aim is to do WGCNA, then my recommendation is to do the following:

  1. Normalisation of your data in DESeq2 and include batch as a covariate in the design model
  2. transform the normalised counts via the rld() or vst() functions and set blind=FALSE
  3. check that batch has been removed via PCA bi-plots
  4. do whatever filtering you want on the final transformed data for the purposes of conducting WGCNA.

It's not good practice to remove batch directly from the variance stabilised counts.

Edit: it has come to my attention that batch correction on variance stabilised counts was performed in this study, https://www.nature.com/articles/s41591-018-0028-4#Sec9

Users may additionally be interested in this post by Michael Love, whereby batch is removed from variance stabilised counts via a limma function: Batch correction in DESeq2

Kevin

ADD COMMENTlink modified 9 months ago • written 9 months ago by Kevin Blighe39k
1

Further information on this:

When transforming via rlog() or vst(), a key parameter to set is 'blind'

When blind=TRUE, the transformation does not take into account the design formula. When blind=FALSE, the transformation will take into account the design formula and, theoretically, adjust for things like batch in that way.

In the past, including batch in the design formula and then setting blind=FALSE was sufficient for removing batch and other confounding factors in my RNA-seq studies.

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe39k

Thanks for your responses. Your suggestion to include the batch in the design formula is what I usually do for differential expression. I had been led to believe that I had to remove it rather than model it for wgcna, but thinking about it, as long as I can't see it in the PCA it probably doesn't matter how I do it so I'll give that approach a shot as it makes a lot more sense to me.

Just for interest, I think I am going to remove the samples I don't want at the beginning. These are samples that were prepared using a different protocol which the person who did the lab work thinks didn't work correctly. I included them in my preliminary exploratory analysis to see if they were useful, but they obviously aren't, so I think I'll just get rid of them to avoid confusing everything else.

ADD REPLYlink written 9 months ago by crouch.k10
1

Yes, one should obviously be conservative and assume that there may be no batch effect. Each study is different, of course. I find, in conversations, that many people overly worry about batch adjustment to the point where they may inadvertently do too much adjustment and end up bringing in more bias than they had. A good study design obviously safeguards against having these issues in the first place.

If it's shown that some samples have failed for whatever reason, then it's also perfectly reasonable to exclude them from the outside.

Also, as you can see, technically you were not incorrect in your initial approach. there are various ways of / points at which batch can be dealt with.

ADD REPLYlink written 9 months ago by Kevin Blighe39k

Yeah, I was assuming there wasn't a "correct" answer, but I was concerned because the way samples clustered in the exploratory analysis was changing depending on the order I did things in. Input from outside is useful because I was getting to the point of over thinking and confusing myself so thanks for your help.

Your point about over-correcting for batch effects is noted - thanks.

ADD REPLYlink written 9 months ago by crouch.k10
1

Indeed, and I edited my original answer after having had a brief discussion with other moderators, including Devon and Venu.

ADD REPLYlink written 9 months ago by Kevin Blighe39k

Just to close this, removing the samples that I had decided not to include prior to building the model solved my issues. After that, the signal remained fairly consistent no matter what I did. I suspect these samples that were prepared differently were creating noise that was masking the signal we are interested in.

Regarding the batch effect issue, I found that I did have to remove this in order to take the data into wgcna even if I had built it into the model. While building it into the model accounted for it in the differential expression analysis, it was still evident in the vst-transformed data even with blind=F. I am not sure if I did something else wrong - but in any case for wgcna I ran Limma::removeBatchEffect() as suggested in the post from Michael Love that you linked to. This seems to be working well and I am happy with this solution. Thanks again for your help. Marking this answer as accepted.

ADD REPLYlink written 8 months ago by crouch.k10

Great, thanks for coming back to update us, crouch.k.

ADD REPLYlink written 8 months ago by Kevin Blighe39k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1875 users visited in the last hour