Question

Strand specific and unstranded dataset

0

Entering edit mode

6.6 years ago

lirongrossmann ▴ 40

Hey everybody,

I am working with an RNA-seq dataset that combines samples from strand-specific and unstranded technology. Ideally I would like to use all the samples to be able to detect DE genes between biological groups.

I am trying to deal with the problem at the gene expression level and treating the difference in technologies as batch effect, however my dataset is no very balanced (the biologic groups are not evenly distributed between the two technologies).

Has anyone dealt with a similar situation before? Is there a way to account for the difference in technologies in upstream stages (like counting)?

Thanks a lot,

Liron

RNA-Seq • 2.3k views

ADD COMMENT • link 6.6 years ago by lirongrossmann ▴ 40

1

Entering edit mode

Thank you very much! I will do that and see what happens! I appreciate your help, Liron

ADD REPLY • link 6.6 years ago by lirongrossmann ▴ 40

0

Entering edit mode

Hi Again! I tried DEseq2 and I like it! My question is how did you know that the batch effect was removed? I give the batch as a variable in the design to deseq2, and I get a set of differentially expressed genes, but how do I know that the batch was removed? Is there a way to perform pca or clustering on the data without the batch?

ADD REPLY • link 6.6 years ago by lirongrossmann ▴ 40

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. These comments should have gone under @Kevin's answer.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Derive transformed expression levels via rlog(..., blind = FALSE) or vst(..., blind = FALSE), and then check the samples on a PCA bi-plot.

ADD REPLY • link 4.0 years ago by Kevin Blighe 87k

GenoMax · Answer 1 · 2017-09-14

4

Entering edit mode

6.6 years ago

Kevin Blighe 87k

[edited April 25, 2020]

Hi Liron,

Yes, I have dealt with this exact situation in the past. I had 4 batches of RNA-seq (total sample size ~400). Three of the 4 were unstranded and were sequenced at low depth, whilst the other was stranded and was sequenced to high depth.

I processed them all in exactly the same way using Kallisto and counting abundances over the GENCODE transcripts in FASTA format (upward of 200,000 transcripts and isoforms). Loading the estimated counts into DESeq2 thereafter, all I needed to do was include batch as a likely confounder of the counts, and DESeq2 was able to perfectly adjust the derived transformed expression levels via blind = FALSE when executing rlog() or vst(), i.e., the assumed batch effect vanished.

I suggest that you try the same and see what you get. There are programs that are specifically designed to correct for batch effects in RNA-seq data, though.

Kevin

PS - my dataset was neither balanced: the stranded, high depth dataset totaled around just 30 samples.

ADD COMMENT • link 4.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Again! I tried DEseq2 and I like it! My question is how did you know that the batch effect was removed? I give the batch as a variable in the design to deseq2, and I get a set of differentially expressed genes, but how do I know that the batch was removed? Is there a way to perform pca or clustering on the data without the batch? Thanks a lot!

ADD REPLY • link 6.6 years ago by lirongrossmann ▴ 40

0

Entering edit mode

If you colour the samples by batch in PCA, you can visually inspect the bi-plot to see how different (or not) the batches are. The code would be something iike this:

DESeq2::plotPCA(assay(rld), intgroup=c("Batch"))

For PCA, also remember that the percent variation explained by each principal component is important. If your PC1 only explains 20% variation, then there's not a major difference between your groups as separated by PC!, and so on for PC2, PC3, ..., PCn

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks! What I am a little confused about is that rld is obtained using the model with batch as a confounder for the differential expression but it does not remove the batch, so how can I tell that the batch was removed? (btw, it's a different data set from the one we are talking about in the other threads). My understanding is that the confounders are taken into account in the Deseq function not in the rlog (or vsd for that matter). Here my code for that: (treat is the vector with the biologic effect I am looking at, batch is stranded vs unstranded)

dds <-DESeqDataSetFromMatrix(countData = counts,colData = meta,design = ~batch+treat) 
dds <- estimateSizeFactors(dds)
rld <- rlog(dds)
plotPCA(rld, intgroup="batch")
dds <- DESeq(dds)
res <- results(dds)

Maybe I am missing something here. Hopefully, you can help me nail it. Thanks again!

ADD REPLY • link updated 6.6 years ago by GenoMax 141k • written 6.6 years ago by lirongrossmann ▴ 40

0

Entering edit mode

Okay, and you would expect the samples from the different batches to overlap perfectly on the PCA? If DESeq2 could not correct for this batch effect, then I would separate the samples into different datasets and, possibly, use one for training and the other for validation. If the batch effect is great, then they should not be analysed together. You could take the risk of continuing to include batch as a confounder in your statistical tests, but I would not recommend that.

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k