Question

Batch Effect

0

Entering edit mode

5 days ago

Umair • 0

Is it really CONSIDERED a batch effect when I extracted information from FASTQ file, reading first lines of it, and got the run_number and flowcell_ID from those lines? Or I am unintentionally reading too much by extracting such information for my RNA-sequence dataset which actually is not a batch effect?

batch effect • 717 views

ADD COMMENT • link updated 5 hours ago by ATpoint 90k • written 5 days ago by Umair • 0

score 1 · Answer 1 · 2025-11-06

1

Entering edit mode

5 days ago

Kevin Blighe 89k

Hey,

It is not 'over-interpreting' - the information that you have extracted can indeed be used to identify potential batches. In RNA-seq, the sequencing run and flowcell are well known sources of technical variation / batch effects, and, for this reason, are sometimes explicitly included in the statistical model. The flowcell ID, in particular, can be important.

To check if these are actually driving a batch effect in your data, I would advise to generate a PCA bi-plot (or heatmap) of your normalised counts and colour the samples by flowcell / run. If you see a clear separation, then, yes, there is a batch effect. In that case, you can use this information as a covariate in your model, e.g., in DESeq2's design formula.

Kevin

ADD COMMENT • link 5 days ago by Kevin Blighe 89k

0

Entering edit mode

Thank you Kevin for your reply. Kindly, can you advise me after checking the PCAs of my dataset?

enter image description here

ADD REPLY • link 4 days ago by Umair • 0

2

Entering edit mode

I would not consider the flow cell meaningfully and clearly impacting things based on your non-batch corrected plots. I'd leave it out of your model design personally. Note that checking additional PCs can also be helpful (PC3/4), but it can quickly become a ghost hunt. True batch effects are typically pretty obvious.

ADD REPLY • link 4 days ago by jared.andrews07 ★ 19k

0

Entering edit mode

Thank you for the reply. What do you suggest me that should I keep all 3 replicates for each treatment or drop some of them? as from PCA it appears that atleast two replicates for most of my treatments cluster closer. My PCA confuses me.

ADD REPLY • link 4 days ago by Umair • 0

1

Entering edit mode

Looking at your PCA bi-plot, which was generated by my Bioc package, PCAtools, I would remove the sample at the top-right, then re-do everything, and then re-assess.

ADD REPLY • link 2 days ago by Kevin Blighe 89k

1

Entering edit mode

The thing with low sample sizes like n=3 is that it is hard to tell what really is an outlier. Removing points reduces power even more (which at n=3 is already low). Overall, there is no strong group separation, suggesting modest numbers of DEGs. What you can do is to use the limma framework for DE testing and either use arrayWeights() together with limma-trend, or either voomWithQualityWeights() or voomLmFit() (the latter with sample.weights = TRUE, see user guide for details) to automatically down-weight samples that show outlier behaviour for many genes according to the design. That is a relatively simple workaround to avoid manually removing samples while dampening the effect of "data-driven-defined" outliers. While this is not necessarily "better" than removing samples, it at least is automated and reproducible, while PCA interpretation is not, unless you put code to identify outliers by some cutoffs such as SD or MAD. See number of DEGs after using these weighting strategies, check MA- and volcano plots to get an idea whether there is any evidence for the treatment effect. Display DEGs in a heatmap, see whether some samples obviously do not match the group pattern. That all will give an idea of the complete dataset, which is more intuitive than relying on PCA alone.

ADD REPLY • link 2 days ago by ATpoint 90k

0

Entering edit mode

That is definitely true.

ADD REPLY • link 1 day ago by Kevin Blighe 89k

0

Entering edit mode

Thank you ATpoint for the detailed advice. I was considering applying EBSeq for DEG. Now, I think I should rely on limma framework. I am wondering, what would be your advice considering my initial plan of applying EBSeq

ADD REPLY • link 17 hours ago by Umair • 0

0

Entering edit mode

I do not know EBSeq, but generally, any testing framework suffers when n is low and data are noisy. I use limma with the weighting a lot since as said, it is automated hence reproducible, and outlier removal is (sort of) not.

ADD REPLY • link 5 hours ago by ATpoint 90k