Problems with the input (from TPM) to run the WGCNA
1
0
Entering edit mode
12 months ago
OrtegaC • 0

Hello everyone,

Initially I express that I am not very expert in bioinformatics analysis.

I have the TMP from RNAseq data. These data come from Arabidopsis seeds infected with a fungal inoculum. I select the data by calculating the zscore. Thanks to the tutorials and the forums I have managed to run the code. However, I suspect that there is a problem that I need to solve and I don't know how to do it.

datExpr <- read.csv("TPM_Zscore_Ready.csv", header=TRUE,  sep=',')
colnames(datExpr)
row.names(datExpr) = datExpr$Genes
datExpr$Genes = NULL
datExpr = as.data.frame(t(datExpr)) 
dim(datExpr) 

my data (imput) looks like this

Genes
AT1G04990.6
AT3G24180.1
AT4G35510.1
AT2G24420.1
AT5G08450.2
AT5G07970.1
AT5G11530.1
AT4G29380.1
AT5G41350.1
AT3G14270.2

So, I don't know how to properly remove the numbers after the 'dot'. I understand that this is a product of splicing and I assume that it is not just removing them arbitrarily. I was reading about gsub function but I am don't understand very well...

For do the enrichGO analysis, the developers of the clusterprofiler package helped me,

# creating universe: Universe are all expressed genes from datExpr. They will be the background
df <- read.csv("TPM_Zscore_Ready.csv", header=TRUE,  sep=',')
colnames(df)
universe = df [1]
write.csv(universe, file = "GO_enrichment/universe.csv")
# My universe is ready to use in ORA
universe<-as.character(universe[,1])
universe <- gsub("\\..*", "", universe)
universe <- sort(universe, decreasing = TRUE)

# loading DEGs
#After,  in excel I will create a file with all the genes of the selected modules and then I will load them for the GO analysis
DEGs  <- read.delim("GO_enrichment/DEGs.txt", header = TRUE)
deg_list <- vector("list", ncol(DEGs))
names(deg_list) <- colnames(DEGs)
for (i in 1:ncol(DEGs)) {
  gene <- unique(DEGs[, i])
  gene <- gsub("\\..*", "", gene)
  deg_list[[i]] <- gene
}

however for the networks to export in cytoescape and for the gene filter the problem persists. I really appreciate if someone can guide me towards the correction.

Thank you very much

gsub WGCNA function • 1.7k views
ADD COMMENT
1
Entering edit mode

you have the gsub line that removes the dot, so what problem still persists?

ADD REPLY
0
Entering edit mode

So sorry Istvan Albert I did not clearly describe the problem. The gsub in 'univero' had no problems, I was just trying to give an example but it caused confusion.

Thank you very much!

ADD REPLY
1
Entering edit mode
12 months ago
Cera ▴ 20

It's difficult to know without understanding a bit more about the problem you are seeing with cytoscape and with filtering. One thing I will point out is that in your code here:

write.csv(universe, file = "GO_enrichment/universe.csv")
# My universe is ready to use in ORA
universe<-as.character(universe[,1])
universe <- gsub("\\..*", "", universe)
universe <- sort(universe, decreasing = TRUE)

You are writing the "universe" object to file before you make the changes to the items in universe that include stripping off the period character and the charaters that come after it. Therefore, if you try to use "GO_enrichment/universe.csv"; that file will still have periods in the gene IDs that are a problem for your workflow.

Make sure that the file that you're using in cytoscape really does have the gene IDs you expect it to have. R executes as a script, one line at a time, so modifying an object after you've already written it to a file just means that you've modified the object in your R environment, but not that you're modifying what you wrote to a file. I hope that that helps!

If not, please try to describe a little bit more about what incorrect or unexpected behavior you are seeing.

P.S. your gsub command is correct, and does strip the dots and isoforms off the gene names. I tried it in this code:

Genes <- c("AT1G04990.6",
"AT3G24180.1",
"AT4G35510.1",
"AT2G24420.1",
"AT5G08450.2",
"AT5G07970.1",
"AT5G11530.1",
"AT4G29380.1",
"AT5G41350.1",
"AT3G14270.2")

genes <- data.frame(Genes)

genes$Genes

genes$fixed <- gsub("\\..*", "", genes$Genes)

genes
ADD COMMENT
0
Entering edit mode

So, sorry, I didn't adequately describe the problem. I describe it below

I have the TMP from RNAseq data. These data come from Arabidopsis seeds infected with a fungal inoculum. Sequencing reads were aligned to the Arabidopsis transcriptome using Salmon, version 0.14.1 and I get isoforms separately.

My code WGCNA work very well, However, I suspect that there is a problem that I need to solve and I don't know how to do it.

datExpr <- read.csv("TPM_Zscore_Ready.csv", header=TRUE, sep=',') colnames(datExpr) row.names(datExpr) = datExpr$Genes datExpr$Genes = NULL datExpr = as.data.frame(t(datExpr)) dim(datExpr)

my input have 19 columns the first one is the genes. The remaining 18 correspond to three samples with three replicates each.

enter image description here

Commn genes looks like this (with isoforms) Genes AT1G04990.6 AT5G08450.2 AT5G07970.1 AT1G76970.1 AT5G17510.2 AT5G08335.1 AT1G08800.2 AT2G46800.5 AT1G05090.1 AT5G65510.4 AT2G15530.7 AT1G30000.1 AT2G32660.2 AT3G46920.2 AT3G07790.1 AT3G24870.1 AT3G10490.2

So, I don't know how to properly remove isoforms for to have unique genes. networks on cytoescape and genes selected look as isoforms.

enter image description here

Thank you very much

ADD REPLY
0
Entering edit mode

I used your line Cera with my data and it work!

datExpr$Genes 
datExpr$Genes <- gsub("..*", "", datExpr$Genes) datExpr
dim(datExpr)

But I have an error

> row.names(datExpr) = datExpr$Genes
Error in `.rowNamesDF<-`(x, value = value) : 
  les duplications dans 'row.names' ne sont pas autorisées
In addition: Warning message:
non-unique values when setting 'row.names': ‘AT1G01080’, ‘AT1G01100’, ‘AT1G02520’, ‘AT1G02850’, ‘AT1G03040’, ‘AT1G03850’, ‘AT1G04200’, ‘AT1G04590’, ‘AT1G04990’, ‘AT1G07200’, ‘AT1G07350’, ‘AT1G07890’, ‘AT1G08110’, ‘AT1G08900’, ‘AT1G08970’, ‘AT1G09570’, ‘AT1G10522’, ‘AT1G10910’, ‘AT1G11840’, ‘AT1G11850’, ‘AT1G13090’, ‘AT1G14580’, ‘AT1G14700’, ‘AT1G14760’, ‘AT1G15170’, ‘AT1G16560’, ‘AT1G17145’, ‘AT1G17280’, ‘AT1G17940’, ‘AT1G18070’, ‘AT1G19230’, ‘AT1G20920’, ‘AT1G21840’, ‘AT1G23080’, ‘AT1G23460’, ‘AT1G23540’, ‘AT1G26150’, ‘AT1G28090’, ‘AT1G29470’, ‘AT1G30400’, ‘AT1G31860’, ‘AT1G33050’, ‘AT1G34180’, ‘AT1G34550’, ‘AT1G34760’, ‘AT1G35320’, ‘AT1G35580’, ‘AT1G36940’, ‘AT1G44446’, ‘AT1G45249’, ‘AT1G45688’, ‘AT1G48600’, ‘AT1G51160’, ‘AT1G53380’, ‘AT1G53570’, ‘AT1G53780’, ‘A [... truncated] 

Because I haved duplicated

#count unique IDs
length(unique(datExpr$Genes))
#there are 6322 unique genesID

#check how many duplicates there are
sum(duplicated(datExpr$Genes))
# there are 335 duplicated

# Look duplicate rows
datExpr[duplicated(datExpr$Genes),]

Do you know what is the proper process to handle these duplicates?

Thank you very much

ADD REPLY
1
Entering edit mode

OrtegaC

You can't convert an isoform expression matrix into a gene expression matrix like that.

You must quantify the expression at the gene level.

ADD REPLY
0
Entering edit mode

please, could you explain more about it?

Thanks,

ADD REPLY
0
Entering edit mode

Right now you have and isoform co-expression network because datExpr contains isoform expression data.

You can't convert an isoform co-expression network into a gene co-expression network by simply converting the isoform id into gene id with gsub.

You must build an entirely new network with gene expression data.

ADD REPLY
0
Entering edit mode

Ok, thank for your explanation.

I would like to ask you, if in this case it would be appropiate used gsub to remove only the ending (. 1, .2 etc... of the single gene) After, to go from TMP data, sum the equal isoforms and divided as correspond for to have one unique transcripts (geneID). so I would have a new table (where there is only one genID) as indicated here: link1:link2

datExpr <- read.csv("TPM_Zscore_Ready.csv", header=TRUE,  sep=',')

str(datExpr)

colnames(datExpr)

Sequencing reads were aligned to the Arabidopsis transcriptome using Salmon, version 0.14.1 and I get isoforms separately.

use gsub function to fix the problem with isoforms

    datExpr$Genes

    datExpr$Genes <- gsub("\\..*", "", datExpr$Genes)

    datExpr

    dim(datExpr)

#6657 genes

count unique IDs

    length(unique(datExpr$Genes))

#there are 6322 unique genesID

check how many duplicates there are

    sum(duplicated(datExpr$Genes))

# there are 335 duplicated

Look duplicate rows

datExpr[duplicated(datExpr$Genes),]

x <- datExpr[duplicated(datExpr$Genes),] 

table(x$Genes)

There are multiples isoforms for the same gene. So, there is necessary sum the TPM values of individual transcripts to get a gene-level TPM turn data.frame into a data.table

datExpr <- data.table(datExpr)

datExpr <-(datExpr)[,lapply(.SD, sum) ,.(Genes)]

saving as data frame

    datExpr <- as.data.frame(datExpr)

    dim(datExpr)

#look: there are 6322 data and 19 variables, to correspond to line 75

Thank you very much

ADD REPLY
1
Entering edit mode

That is not the proper way to proceed. If Salmon was used to generate an isoform expression matrix, use tximport to aggregate transcript abundance data to the gene level.

ADD REPLY
0
Entering edit mode

thank you, I really appreciate your comments. I will try to do your recomendations.

thank you

ADD REPLY
2
Entering edit mode

Before jumping on your data, follow the tutorial first, and keep particular attention to the creation of the tx2gene file.

Have fun!

ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6