pre-proccessing of RNAseq data for WGCNA
1
0
Entering edit mode
8 weeks ago
txtbookir ▴ 10

Hi everyone, i wanted to create an expression matrix for WGCNA input. however, i has been said that use RPKM/FPKM data instead of CPM, how can i change my TCGA data to RPKM/FPKM in GDCquery and how to filter expression set of genes by FDR to less than 5000, which is ideal for WGCNA as i have 17000 genes in expression set, but i can not add p-value without losing expression set.

TCGA WGCNA • 680 views
2
Entering edit mode
8 weeks ago

Hi, you do not have to use FPKM/RPKM. If you have CPM, then please use that, but preferentially log them. Please read part 4 of the FAQ: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html

how to filter expression set of genes by FDR to less than 5000

Please just filter them based on minimum mean or sum (total) CPM.

Kevin

0
Entering edit mode

Dear Kevin thank you for your helpful comment, I wanted to use DEGs identified from EdgeR/limma-voom normalized by TMM and extract their expression data from norm_counts (below) for WGCNA, which is mentioned in several good papers (e.g. PMC6660050), however in the part 2 of the FAQ it has been said that "We do not recommend filtering genes by differential expression", is this wrong? and can i use "norm" from EdgeR following code which is logged CPM, while it has 20000 genes? isn't this too long for WGCNA? can i just do another filter based on Adjusted p value? because it is said that top 5000 is good for WGCNA

keep = filterByExpr(dge,design) # defining which genes to keep
dge = dge[keep,,keep.lib.sizes=FALSE] # filtering the dge object
rm(keep)

dge = DGEList(
counts=assay(tcga_data),
samples=colData(tcga_data),
genes=as.data.frame(rowData(tcga_data)))

dge = calcNormFactors(dge,method="TMM")

norm_counts <- cpm(dge, log =TRUE )

v = voom(dge,design,plot=TRUE)
norm <-v\$E


thank you so much, I'm confused with these.

1
Entering edit mode

Hi, well, what do you hope to achieve by doing WGCNA? There is a stark difference between running WGCNA on DEGs compared to running WGCNA on all genes. It depends on what you are hoping to achieve by using WGCNA.

0
Entering edit mode

actually, I wanted to get significant modules related to TNM stages (I-IV) and then use these genes for venn and down stream analysis, in this regard, are 20000 genes good or i have to get top 5000 based on ANOVA? and i think those papers who are using DEGs for WGCNA have done wrong analysis as not recommended by WGCNA authors, what do you think?

1
Entering edit mode

I see, then you should do WGCNA unfiltered, and then correlate the module eigengenes to TNM stage. If, for example, the green module statistically significantly correlates to TNM stage, then you would explore those genes comprising the green module.

0
Entering edit mode

Dear kevin, Thank you kindly for your valuable help, may i ask your opinion on this recent post related to WGCNA too?

WGCNA for diferent stages (I-IV)

Thank you so much

0
Entering edit mode

Dear kevin, after WGCNA analysis (step by step method) for 17000 genes i got this dendrogram, which had10035 genes in module 0 (grey). do i have to reduce my genes before WGCNA by another filtering method? i think this is not a good figure for paper, isn't this?

1
Entering edit mode

Hi, it is more important what the data means after you do the module-trait correlations / relationships. However, the fact that ~10000 genes are assigned to grey tells me that you should do more rigorous filtering of your input data.

0
Entering edit mode

Dear kevin, which method is better to do such filtering? is my input data file okay? my data.csv is generated from edgeR/limma method after TMM normalization and applying voom for log transformation.

1
Entering edit mode

I would filter at the raw count stage. For example, filter out genes with mean raw count < 20