Hi, My RNA-seq data set belongs to TCGA data bank. I download this data set by TCGAbiolinks package.for using this package, I define a special query that one argument of this query is workflow.type = "HTSeq - FPKM-UQ". I mean, my data is Normal. So, for using this data as WGCNA input, I transform it to log2 value. But, in bioconductor support froum, I see some below comment from the author of model:
"you can certainly use WGCNA for RNA-seq data. Two recommendations: 1. Filter out genes whose count is less than say 5 in more than say 80% of the samples. This gets rid of a lot of noise and gets rid of expression profiles for which correlation makes little sense. 2. Use a variance-stabilizing transformation, such as the one implemented in varianceStabilizingTransformation or rlogTransformation in the DESeq2. I have analyzed a few RNA-seq data sets and have had great results."
Now, I don't know my data set needs any additional pre-processing or not? I appreciate if anybody share his/her comment with me. Best regards, Mohammad
Hello Dr. Kevin
Thanks. I need more explanation. my detail query is :
query <- GDCquery(project = "TCGA-BRCA",sample.type = "Primary solid Tumor", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification",workflow.type = "HTSeq - FPKM-UQ"); So, for workflow.type argument, I can select "HTSeq-Counts" or "HTSeq-FPKM" or "HTSeq-FPKM-UQ". based on your comment, really I can't decide to select which argument.It's better to say that which data is better for me? I appreciate if you share your comment with me. Best Regards, Mohammad
If you select HTseq raw counts, then you will have to process them (i.e. normalise) yourself. The FPKM and FPKM-UQ counts are already normalised but are absolutely not suitable for cross-sample comparisons, i.e., not suitable for differential expression analysis.
If you are just going to use WGCNA, which is fundamentally based on correlation, then the FPKM-UQ counts would be 'okay'.
If you want to see this as a learning exercise and to pick up skills in processing raw counts RNA-seq data, however, then start with the HTseq counts and take a look here: http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#htseq-count-input
OK, for now I need to use this data in WGCNA. let me I came back to your first comment. The value of expression genes in FPKM-UQ file that I downloaded are different from 0 to 770296.2446.So, first of all, I transformed these data to new space by log2. is it true action? then if is it true, you mean I have to filter out genes expression less than 5? I appreciate if you give me more explanation about pre-process of my data set in this step. Best Regards, Mohammad
That is not ideal but it should be fine for WGCNA and, again, it is fine because WGCNA is based on correlation and not differential expression comparisons.
You could also transform the FPKM data to the Z scale using the zFPKM package in R.
Thanks,now I have anothe problem.
Dear Dr. Kevin Blighe I have another problem. As you know, in WGCNA tutorial is based on Microarray data set and my data set is RNA-seq expression. So I have to prepare annotation file same as Annotation.csv. Based on my data set, I downloaded gencode.v22.annotation.gtf file but I don’t know what process must do on that file to would be similar tutorial annotation file. I appreciate if you share your comment with me.
Best Regards,
Mohammad
Hello Mohammad. In which format are your IDs, currently? Are they ENSEMBL IDs?
GENCODE provides 'translational' tables for annotation in different formats. See the bottom of the following web-page: https://www.gencodegenes.org/releases/current.html
Thanks,
My data set is belong to TCGA. So, I found gencode.v22.annotation.gtf.gz as annotation file from below link:
https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files
Best Regards,
Mohammad
But what is the problem, exactly? In which format are your gene IDs, currently? They should already be HGNC symbols, or ENSEMBL.
Dear Dr. Blighe
Thanks for your comment. I have 2 problem. As you know for preparing output file for gene ontology analysis in WGCNA, I need GeneAnnotation.csv file for my Gene Expression profile. My Gene Expression data set is RNA-seq data from The Genome Cancer Atlas (TCGA). For that purpose I downloaded "gencode.v22.genes.csv" but the attributes name of "gencode.v22.genes" is not similar to GeneAnnotaion.csv file in Tutorials for the WGCNA package.
I don't know how should I handle Annotation file for my dataset?
And my second problem is that below code is based on MicroArray and my data set is RNA-seq.
How can I customize that code for RNA-seq?
I appreciate if you share your comment with me.
So, you have ENSEMBL gene names (begin with 'ENSG') and you need to convert these to HGNC / Official gene symbols?
If that is the case, then please take a look here:
[your ENSEMBL gene list would go where the
rownames(matrixRLD)
is mentioned]Dear Dr. Blighe
Thanks for your comment. As you know I run WGCNA for my study.Now, I want to import my network in Cytoscape for visualization. based on WGCNA tutorial, for that purpose I have to run below code:
when I want to run:
I got below error:
Error: object 'TOM' not found.
So, my question is what is TOM.should I calculate TOM via below code:
I appreciate if you share your comment with me.
Best Regrds,
Mohammad