Question

Collapsing biological replicates for co-expression analysis using WGCNA

0

Entering edit mode

3.9 years ago

ab4232 • 0

First my sincere thanks to all community members. Posts here really helps people like us who are new in the field.

Recently I completed a RNA-Seq project consisting 25 samples with 3 biological replicates each. In brief, due to absence of a reference genome for the organism of interest, I performed denovo transcriptome assembly followed by redundancy removal, estimating raw read counts, and differential expression using DeSeq2. Now I need to perform co-expression using WGCNA package, which I have done only once before but it was using output from tuxedo pipeline.

Now from DeSeq2, I have the normalized, rlog, variance stabilized counts, but the count matrix has 75 entries (25 samples x 3 replicates). Earlier in output from Tuxedo pipeline fpkm obtained were after collapsing the biological replicates. So seek help or any suggestion on how to handle the biological replicates for WGCNA analysis from DeSeq2 output (mention in previous lines) or can the biological replicates be collapsed somehow in order to perform co-expression on final set of 25 samples. Everywhere it is suggested not to use DeSeq2's collapseReplicates for biological replicates.

I have been searching for solution for sometime and really appreciate any help or suggestion to proceed further. Thanks.

RNA-Seq WGCNA DeSeq2 • 2.3k views

ADD COMMENT • link updated 3.9 years ago by Kevin Blighe 87k • written 3.9 years ago by ab4232 • 0

score 4 · Accepted Answer · 2020-05-29

4

Entering edit mode

3.9 years ago

Kevin Blighe 87k

Hi,

I would proceed to WGCNA with the variance stabilised expression levels and without any collapsing of replicates.

Kevin

ADD COMMENT • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks. Really appreciate it. I have two small doubts.

Won't WGCNA will treat each biological replicate as individual sample ?
If I would like to perform clustering like k-means, to see expression pattern across the samples, then what to use because then again it will be across biological replicates (25 samples x 3 replicates = 75) than across 25 samples.

Sorry if I asking silly question. With tool like cuffdiff, these were not a problem.

ADD REPLY • link 3.9 years ago by ab4232 • 0

0

Entering edit mode

Are you sure that these are not technical replicates? Could you explain further the source of the samples. Normally we do not collapse biological replicates, but we may collapse technical replicates.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes I am sure they are biological replicates. The samples are plant tissue samples collected at 25 different time points from 3 individuals (same genotype) grown in separate pots under identical conditions.

With getting familiar with packages like DeSeq2, I learned not to collapse biological replicates. My confusion arose because cuffdiff's fpkm.tracking file use to have single fpkm value for each sample for a given gene, inspite the biological replicates given in input. And it was easy to use it for tasks like clustering, WGCNA.

Hope I am making some sense.

ADD REPLY • link 3.9 years ago by ab4232 • 0

1

Entering edit mode

I see - thank you for elaborating. Problem there is that FPKM expression units should not even be used for clustering purposes, or any analyses where samples are being compared in any way, in my opinion. The TopHat2 / Cufflinks pipeline (and, these days, HISAT2 / StringTie) are good for performing de novo transcriptome assembly and discovering new transcripts and / or splice isoforms - in this way, they provide a summary metric, a single value, across replicates.

If you have already used DESeq2 with your data, then, for WGCNA, I would use [as input to WGCNA] the regularised log or variance-stabilised expression values that are produced by DESeq2.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I really appreciate you taking out time to reply and helping out. Earlier with fpkm I used log2(fpkm+1) for clustering etc.

Just one last one. While further searching online, I came across this link . Do you think it is a good approach ? I understood till calculating Spearman's correlation, but didn't understood how they calculated the weights. Specially the line

"The weighting of each replicate is then calculated as the normalized sum of associations between each replicate with the others."

Thanks again.

ADD REPLY • link 3.9 years ago by ab4232 • 0

1

Entering edit mode

Yes, but, FPKM units are produced in a way such that absolutely no cross-sample normalisation occurs. So, even logging or Z-scaling these will still leave bias in the data. You essentially cannot faithfully compare the FPKM value, either logged or unlogged, in one sample versus another.

The CMAP analysis shown via the link is a specific use-case. For WGCNA, I would still favour not collapsing the biological replicates. The inquisitive nature within me would do it with and without collapsing, just out of interest. Biology has no rules, and neither therefore does bioinformatics.

ADD REPLY • link 3.9 years ago by Kevin Blighe 87k