Question: Strand specific and unstranded dataset
0
gravatar for lirongrossmann
5 weeks ago by
lirongrossmann20 wrote:

Hey everybody,

I am working with an RNA-seq dataset that combines samples from strand-specific and unstranded technology. Ideally I would like to use all the samples to be able to detect DE genes between biological groups.

I am trying to deal with the problem at the gene expression level and treating the difference in technologies as batch effect, however my dataset is no very balanced (the biologic groups are not evenly distributed between the two technologies).

Has anyone dealt with a similar situation before? Is there a way to account for the difference in technologies in upstream stages (like counting)?

Thanks a lot,

Liron

rna-seq • 183 views
ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by lirongrossmann20
1

Thank you very much! I will do that and see what happens! I appreciate your help, Liron

ADD REPLYlink written 5 weeks ago by lirongrossmann20

Hi Again! I tried DEseq2 and I like it! My question is how did you know that the batch effect was removed? I give the batch as a variable in the design to deseq2, and I get a set of differentially expressed genes, but how do I know that the batch was removed? Is there a way to perform pca or clustering on the data without the batch?

ADD REPLYlink written 4 weeks ago by lirongrossmann20

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. These comments should have gone under @Kevin's answer.

ADD REPLYlink written 4 weeks ago by genomax34k
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe3.3k
Republic of Ireland (√Čire)
Kevin Blighe3.3k wrote:

Hi Liron,

Yes, I have dealt with this exact situation in the past. I had 4 batches of RNA-seq (total sample size ~400). Three of the 4 were unstranded and were sequenced at low depth, whilst the other was stranded and was sequenced to high depth.

I processed them all in exactly the same way using Kallisto and counting abundances over the GENCODE transcripts in FASTA format (upward of 200,000 transcripts). Loading the counts into DESeq2 thereafter, all I needed to do was include batch as a likely confounder of the counts and DESeq2 was able to perfectly adjust them (the counts). The assumed batch effect vanished.

I suggest that you try the same and see what you get. There are programs that are designed to specifically correct for batch effects in RNA-seq data, but I have not had much confidence in them after having used them quite substantially.

Kevin

PS - my dataset was neither balanced: the stranded, high depth dataset totaled around just 30 samples.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe3.3k

Hi Again! I tried DEseq2 and I like it! My question is how did you know that the batch effect was removed? I give the batch as a variable in the design to deseq2, and I get a set of differentially expressed genes, but how do I know that the batch was removed? Is there a way to perform pca or clustering on the data without the batch? Thanks a lot!

ADD REPLYlink written 4 weeks ago by lirongrossmann20

If you colour the samples by batch in PCA, you can visually inspect the bi-plot to see how different (or not) the batches are. The code would be something iike this:

DESeq2::plotPCA(assay(rld), intgroup=c("Batch"))

For PCA, also remember that the percent variation explained by each principal component is important. If your PC1 only explains 20% variation, then there's not a major difference between your groups as separated by PC!, and so on for PC2, PC3, ..., PCn

ADD REPLYlink modified 29 days ago • written 29 days ago by Kevin Blighe3.3k

Thanks! What I am a little confused about is that rld is obtained using the model with batch as a confounder for the differential expression but it does not remove the batch, so how can I tell that the batch was removed? (btw, it's a different data set from the one we are talking about in the other threads). My understanding is that the confounders are taken into account in the Deseq function not in the rlog (or vsd for that matter). Here my code for that: (treat is the vector with the biologic effect I am looking at, batch is stranded vs unstranded)

dds <-DESeqDataSetFromMatrix(countData = counts,colData = meta,design = ~batch+treat) 
dds <- estimateSizeFactors(dds)
rld <- rlog(dds)
plotPCA(rld, intgroup="batch")
dds <- DESeq(dds)
res <- results(dds)

Maybe I am missing something here. Hopefully, you can help me nail it. Thanks again!

ADD REPLYlink modified 27 days ago by genomax34k • written 28 days ago by lirongrossmann20

Okay, and you would expect the samples from the different batches to overlap perfectly on the PCA? If DESeq2 could not correct for this batch effect, then I would separate the samples into different datasets and, possibly, use one for training and the other for validation. If the batch effect is great, then they should not be analysed together. You could take the risk of continuing to include batch as a confounder in your statistical tests, but I would not recommend that.

ADD REPLYlink written 27 days ago by Kevin Blighe3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour