Hi! I'm a beginner in bioinformatics and trying to replicate the result from a paper named TAZ Expression as a Prognostic Indicator in Colorectal Cancer (https://www.researchgate.net/publication/235393359_TAZ_Expression_as_a_Prognostic_Indicator_in_Colorectal_Cancer)
Currently, I'm working with GSE14333 from GEO dataset.
To make Figure 1, I searched for the genes named "Axl", "WWTR1", "YAP1" and "CTGF" from each of their entrez id in data@featureData@data$ENTREZ_GENE_ID. I've obtained several genes (a row in the expression matrix) matching with the same entrez gene id. For e.g.
ID // GB_ACC // ... // Gene Symbol
213342_at // AI745185 // ... // YAP1
224894_at // BF247906 // ... // YAP1
224895_at // AA557632 // ... // YAP1
YAP1 matched with 3 rows, WWTR1 with 3 rows, AXL with 2 rows, and CTGF with 1 row.
It seems like each row for YAP1 is somehow distinct and each of them has different expression level in the expression matrix. Then how can I make the scatter plot above? Should I pick only one if there are multiple rows? Or can I just take the average expression level of all of them?
I hope this Target Description help identifying each of them in the case of YAP1.
[1] "gb:AI745185 /DB_XREF=gi:5113473 /DB_XREF=wg10a05.x1 /CLONE=IMAGE:2364656 /FEA=FLmRNA /CNT=46 /TID=Hs.8939.0 /TIER=Stack /STK=13 /UG=Hs.8939 /LL=10413 /UG_GENE=YAP65 /UG_TITLE=yes-associated protein 65 kDa /FL=gb:NM_006106.1"
[2] "gb:BF247906 /DB_XREF=gi:11163848 /DB_XREF=601858274F1 /CLONE=IMAGE:4068810 /FEA=EST /CNT=137 /TID=Hs.84520.0 /TIER=Stack /STK=51 /UG=Hs.84520 /UG_TITLE=ESTs"
[3] "gb:AA557632 /DB_XREF=gi:2328109 /DB_XREF=nl11g07.s1 /CLONE=IMAGE:1030044 /FEA=EST /CNT=137 /TID=Hs.84520.0 /TIER=Stack /STK=9 /UG=Hs.84520 /UG_TITLE=ESTs"
I'm stucked in here. Please give me a hand.
"_at" are Probe IDs from a microarray experiment, not Entrez IDs. You typically summarize Probe IDs onto a single value per gene, please read about microarray analysis. How did you process these data?
First, I obtained a gene expression level matrix (row: "_at" Probe IDs, column: samples). To replicate the paper, I tried to find out which Probe IDs correspond to "Axl", "WWTR1", "YAP1" and "CTGF". In the raw data, data@featureData@data contains a table of descriptions for each row. I've found Entrez IDs from the description table. Also, I searched for the genes one by one in wikipedia, and found mappings from each gene title to its Entrez ID. So I did an indexing on the rows which have the Entrez ID. But each gene title searched by the same Entrez ID had one or more Probe IDs.
The first sentence is they key. How did you do that?
by this code "exprs(data)" The raw data contain gene expression level matrix(normalized), gene information (descriptions for the rows), and clinic data (descriptions for the columns).