Question

data pre-processing for WGCNA Algorithm

0

Entering edit mode

6.0 years ago

modarzi ▴ 170

Hi, My RNA-seq data set belongs to TCGA data bank. I download this data set by TCGAbiolinks package.for using this package, I define a special query that one argument of this query is workflow.type = "HTSeq - FPKM-UQ". I mean, my data is Normal. So, for using this data as WGCNA input, I transform it to log2 value. But, in bioconductor support froum, I see some below comment from the author of model:

"you can certainly use WGCNA for RNA-seq data. Two recommendations: 1. Filter out genes whose count is less than say 5 in more than say 80% of the samples. This gets rid of a lot of noise and gets rid of expression profiles for which correlation makes little sense. 2. Use a variance-stabilizing transformation, such as the one implemented in varianceStabilizingTransformation or rlogTransformation in the DESeq2. I have analyzed a few RNA-seq data sets and have had great results."

Now, I don't know my data set needs any additional pre-processing or not? I appreciate if anybody share his/her comment with me. Best regards, Mohammad

RNA-Seq WGCNA Pre-processing FPKM TCGA • 4.9k views

ADD COMMENT • link updated 6.0 years ago by Kevin Blighe 87k • written 6.0 years ago by modarzi ▴ 170

score 1 · Answer 1 · 2018-05-11

1

Entering edit mode

6.0 years ago

Kevin Blighe 87k

Hello Mohammad,

Well, one thing is for sure: you should not input the FPKM-UQ counts into DESeq2 in order to do the VST or RLD transformation.

My recommendation would be to filter out genes with low FPKM-UQ counts. I would choose something like 5 or 10. As per the WGCNA developer, you can filter out genes that are <5 or <10 in a certain proportion of samples, or you can filter out genes whose mean FPKM-UQ expression is <5 or <10 across all samples. You can also filter out genes that have low variance, if you wish and if the initial filtering does not work that well.

You can then technically just use the filtered counts on the FPKM-UQ scale for WGCNA. WGCNA is based on correlation and the most important thing is therefore that your sample are just processed and normalised in the same fashion.

Kevin

ADD COMMENT • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello Dr. Kevin

Thanks. I need more explanation. my detail query is :

query <- GDCquery(project = "TCGA-BRCA",sample.type = "Primary solid Tumor", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification",workflow.type = "HTSeq - FPKM-UQ"); So, for workflow.type argument, I can select "HTSeq-Counts" or "HTSeq-FPKM" or "HTSeq-FPKM-UQ". based on your comment, really I can't decide to select which argument.It's better to say that which data is better for me? I appreciate if you share your comment with me. Best Regards, Mohammad

ADD REPLY • link 6.0 years ago by modarzi ▴ 170

1

Entering edit mode

If you select HTseq raw counts, then you will have to process them (i.e. normalise) yourself. The FPKM and FPKM-UQ counts are already normalised but are absolutely not suitable for cross-sample comparisons, i.e., not suitable for differential expression analysis.

If you are just going to use WGCNA, which is fundamentally based on correlation, then the FPKM-UQ counts would be 'okay'.

If you want to see this as a learning exercise and to pick up skills in processing raw counts RNA-seq data, however, then start with the HTseq counts and take a look here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq-count-input

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

OK, for now I need to use this data in WGCNA. let me I came back to your first comment. The value of expression genes in FPKM-UQ file that I downloaded are different from 0 to 770296.2446.So, first of all, I transformed these data to new space by log2. is it true action? then if is it true, you mean I have to filter out genes expression less than 5? I appreciate if you give me more explanation about pre-process of my data set in this step. Best Regards, Mohammad

ADD REPLY • link 6.0 years ago by modarzi ▴ 170

1

Entering edit mode

That is not ideal but it should be fine for WGCNA and, again, it is fine because WGCNA is based on correlation and not differential expression comparisons.

You could also transform the FPKM data to the Z scale using the zFPKM package in R.

ADD REPLY • link 6.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks,now I have anothe problem.

Dear Dr. Kevin Blighe I have another problem. As you know, in WGCNA tutorial is based on Microarray data set and my data set is RNA-seq expression. So I have to prepare annotation file same as Annotation.csv. Based on my data set, I downloaded gencode.v22.annotation.gtf file but I don’t know what process must do on that file to would be similar tutorial annotation file. I appreciate if you share your comment with me.

Best Regards,

Mohammad

ADD REPLY • link 5.9 years ago by modarzi ▴ 170

1

Entering edit mode

Hello Mohammad. In which format are your IDs, currently? Are they ENSEMBL IDs?

GENCODE provides 'translational' tables for annotation in different formats. See the bottom of the following web-page: https://www.gencodegenes.org/releases/current.html

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks,

My data set is belong to TCGA. So, I found gencode.v22.annotation.gtf.gz as annotation file from below link:

https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files

Best Regards,

Mohammad

ADD REPLY • link 5.9 years ago by modarzi ▴ 170

0

Entering edit mode

But what is the problem, exactly? In which format are your gene IDs, currently? They should already be HGNC symbols, or ENSEMBL.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Dr. Blighe

Thanks for your comment. I have 2 problem. As you know for preparing output file for gene ontology analysis in WGCNA, I need GeneAnnotation.csv file for my Gene Expression profile. My Gene Expression data set is RNA-seq data from The Genome Cancer Atlas (TCGA). For that purpose I downloaded "gencode.v22.genes.csv" but the attributes name of "gencode.v22.genes" is not similar to GeneAnnotaion.csv file in Tutorials for the WGCNA package.

I don't know how should I handle Annotation file for my dataset?

And my second problem is that below code is based on MicroArray and my data set is RNA-seq.

# Read in the probe annotation
GeneAnnotation=read.csv(file="GeneAnnotation.csv")
# Match probes in the data set to those of the annotation file
probes = names(datExprFemale)
probes2annot = match(probes,GeneAnnotation$substanceBXH)
# data frame with gene significances (cor with the traits)
datGS.Traits=data.frame(cor(datExprFemale,datTraits,use="p"))
names(datGS.Traits)=paste("cor",names(datGS.Traits),sep=".")
datOutput=data.frame(ProbeID=names(datExprFemale),
                     GeneAnnotation[probes2annot,],moduleColorsFemale,datKME,datGS.Traits)
# save the results in a comma delimited file
write.table(datOutput,"FemaleMouseResults.csv",row.names=F,sep=",")

How can I customize that code for RNA-seq?

I appreciate if you share your comment with me.

ADD REPLY • link 5.9 years ago by modarzi ▴ 170

1

Entering edit mode

So, you have ENSEMBL gene names (begin with 'ENSG') and you need to convert these to HGNC / Official gene symbols?

If that is the case, then please take a look here:

C: RNA-seq, R, DESeq2,

[your ENSEMBL gene list would go where the rownames(matrixRLD) is mentioned]

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Dear Dr. Blighe

Thanks for your comment. As you know I run WGCNA for my study.Now, I want to import my network in Cytoscape for visualization. based on WGCNA tutorial, for that purpose I have to run below code:

# select modules modules = c("blue","brown") 
# Select module probes 
inModule=is.finite(match(moduleColorsFemale,modules)) 
modProbes=probes[inModule] 
match1=match(modProbes,GeneAnnotation$substanceBXH) 
modGenes=GeneAnnotation$gene_symbol[match1] 
# Select the corresponding Topological Overlap 
 modTOM = TOM[inModule, inModule] 
dimnames(modTOM) = list(modProbes, modProbes) 
# Export the network into edge and node list files for Cytoscape 
cyt = exportNetworkToCytoscape(modTOM, 
edgeFile=paste("CytoEdge",paste(modules,collapse="-"),".txt",sep=""), nodeFile=paste("CytoNode",paste(modules,collapse="-"),".txt",sep=""), 
weighted = TRUE, threshold = 0.02,nodeNames=modProbes, 
altNodeNames = modGenes, nodeAttr = moduleColorsFemale[inModule])

when I want to run:

modTOM = TOM[inModule, inModule]

I got below error:

Error: object 'TOM' not found.

So, my question is what is TOM.should I calculate TOM via below code:

> TOM = TOMsimilarityFromExpr(datExpr, power=7)

I appreciate if you share your comment with me.

Best Regrds,

Mohammad

ADD REPLY • link 5.8 years ago by modarzi ▴ 170