Question

Batch effect normalization

0

Entering edit mode

7 months ago

Smriti ▴ 20

hello people!

My RNA-seq data has been sequenced in 5 batches. Upon applying MW stats, I found significance for primary aligned read counts between batches. Since the data is for severity, I compared mild of batch 1 with mild of another batch and likewise for mod and severe categories to assure that the significance between batches (for primary aligned reads for all mild, mod and sev) is coming because of only technical variability as I can't risk of losing actual biological impact between any of those groups.

I found significance in primary aligned read count when I compared mild of batch 1 with another batch and likewise but this wasn't consistent for all.

Considering this, I need to remove the batch effect. since I am at downstream considering coverage breadth (as the question is "how much of the gene body is covered by my seq reads?"), ... for this no tool is available. I read about DSeq2, using limma and other packages in back end for same, so thought to apply that.

Can anyone please help me understand how limma or other more suitable tool performs this batch correction (data normalisation), and how can I do this?

Thanks in advance!

RNA-seq batch-effect • 1.1k views

ADD COMMENT • link updated 7 months ago by Ram 43k • written 7 months ago by Smriti ▴ 20

0

Entering edit mode

Can you post your experimental layout, so which samples belong to which group and batch?

ADD REPLY • link 7 months ago by ATpoint 82k

0

Entering edit mode

sure, so i am trying to figure out how much by coverage breadth, the gene body (reference) got covered by my sequencing reads. in 5 batches, illumina rna seq was performed (paired end), for 112 samples (45, 46, 21 in mild, mod and severe category).. for coverage breadth i did bedtools

to be assure that coverage has got no influence of read count, i did calculate sample wise read count by samtools and qualimap... got exact same figure.

now, to proceed, i observed MW high significance for primary reads percentages (taken without classification...) not to lose any biological impact, i did calculate MW significance this time between batches (mild of one batch with another ...likewise and then likewise for mod and sev categories) .

i still see high significance in most of the groups, though not in all. now looking to remove these batch effects or simply anyhow if i can normalise my read count data.

below is the classification.

Distribution batch-wise 1   2   3   4   5
mild                    6       12  8   11  8
mod                 13      12  1   12  8
severe                  4   0   8   0   9

Primary-aligned read count (%) across batches (severity wise)

mild_1  mild_2  mild_3  mild_4  mild_5  mod_1   mod_2   mod_4   mod_5   sev_1   sev_3   sev_5
84          61         16            80          16        72             23              15              17           71            77           16
77         72          17           78           17       16              84              80               19          84              29             17
79          31         16            53           23          18              78              86              29           87           28            16
78          25          16           80           16           16              82             87              27           52             22              16
50         83           80           61           26          16               87              72             16                       85             16
49         77           65           84           27           80              84             78               16                      77             20
           86           70           72           21           78           77            82               17                       76    18
           85           17           84           32          77               84      86               70              91     17
           16                        86                   81                82     83                                              88
           51                       87                    81              79              80                    
           75                       72                   78           76               84                   
            75                                            76              77               80                   
                                                           89

ADD REPLY • link updated 7 months ago by ATpoint 82k • written 7 months ago by Smriti ▴ 20

0

Entering edit mode

how did you even compare read count can you tell? " i found signifcance for orimary aligned readcounts between batches. since the data is for severity, i compared mild of batch 1 with mild of another batch and likewise for mod and severe categorie" this part?

"i found significance in primary aligned read count when comapred mild of batch 1 with another batch and likewise... but this wan't consistent for all." please give you multiqc result that would help others to validate what you are saying

ADD REPLY • link 7 months ago by 1769mkc ★ 1.2k

0

Entering edit mode

Mann whitney

ADD REPLY • link 7 months ago by Smriti ▴ 20

0

Entering edit mode

can you give your multiqc report upload it or just the screenshot that is helpful for all.

Meanwhile what EDA(Exploratory data analysis ) have you done on your data ? if yes please let know

"can anyone please make me understand how limma or other more suitable tool performs this batch correction (data normalisation), and how can i do this ..." there are many solved issue which you can find in biostar or biconductor, what i understand from your approach is complicated and convoluted, either you are not given the right protocol or you are doing it for the first time. If I were you I wont compare the read count first. I would do some transformation using deseq2 or edger or limma any of them and go ahead with EDA such as start with PCA, clustering, correlation just to get the idea of underlying data

ADD REPLY • link 7 months ago by 1769mkc ★ 1.2k

0

Entering edit mode

![multibamqc_ss

thank you for your response yes, doing it for the first time

uploading screenshots of my multiqc file

ADD REPLY • link 7 months ago by Smriti ▴ 20

score 5 · Accepted Answer · 2023-09-18

5

Entering edit mode

7 months ago

ATpoint 82k

I find this quite difficult to follow since it is poorly formatted.

Anyway:

1) Don't do any analysis on raw read counts or the percent of reads aligned. This is a relatively meaningless statistic. Do analysis on the count matrix, more specifically on normalized counts. For example, start with exploratory analysis such as PCA. See the DESeq2 vignette on how to do this and how to prepare data for it. See if there is evidence for batch effects. After all, only the normalized counts have a meaning. Different raw counts can usually be solved by normalization.

2) If you have batches then include it into the design for DE analysis, like ~batch+group. You can use limma::removeBatchEffect() to remove it from the normalized counts to see batch effect removal on a PCA, for example remove it from the vst-corrected counts. See DESeq2 manual on vst, and ?removeBatchEffects for more details. There is also lots of previous questions addressing this.

3) Don't do MW on these values from above, it's meaningless. See 1) and 2).

ADD COMMENT • link 7 months ago by ATpoint 82k

0

Entering edit mode

thanks for your suggestion. yes, doing this for the very first tiime. but the question in general is different (where i am taking account into the "coverage breadth" and not depth)... so how accurate or meaningfult to follow dseq2 as we do in other rna seq exp to get degs?

understood. i should check batch effect on norm read counts.

thanks indeed

ADD REPLY • link 7 months ago by Smriti ▴ 20

2

Entering edit mode

yes, doing this for the very first tiime.

So follow standard vignettes instead of coming up with custom things like "coverage breadth"...no clue what this means :) Keep it simple, do what everyone does untilyou have expert knowledge to do fancy other things.

ADD REPLY • link 7 months ago by ATpoint 82k

0

Entering edit mode

" so how accurate or meaningfult to follow dseq2 as we do in other rna seq exp to get degs?" have you gone through enough literature with regards to basic analysis pipeline from count to diffrential gene?

ADD REPLY • link 7 months ago by 1769mkc ★ 1.2k