Question: Extracting gene symbols from gene assignments in exon array data
0
gravatar for Kim
6 days ago by
Kim0
Kim0 wrote:

Hello everyone

I'm working on gene expression data from a human exon array. I want to have a column of gene symbols but the only column giving me that information is "gene assignment" and the information looks like this.

NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494

I would like to extract gene symbols from this (CCDC81 in this case). Does anyone know how I can do that in R?

Thank you very much

ADD COMMENTlink modified 6 days ago by Pierre Lindenbaum122k • written 6 days ago by Kim0

Have you tried the strsplit function in R?

ADD REPLYlink written 6 days ago by Russ460

Yes I'm trying to use strsplit but this function works with vector and the "gene assignment" data type is factor so it makes the work not straightforward.

ADD REPLYlink written 6 days ago by Kim0
1

It's hard to propose help when your problem is not completely described in the original question. The following works for me, could it be adapted to your data?

   > a <- as.factor("NM_001156474 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// NM_021827 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000445632 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000354755 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// BC126412 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494 /// ENST00000278487 // CCDC81 // coiled-coil domain containing 81 // 11q14.2 // 60494")
    > strsplit(as.character(a), " // ")[[1]][2]
    [1] "CCDC81"
ADD REPLYlink modified 6 days ago • written 6 days ago by Russ460

Hi Russ

I tried this command and it works. Thank you :)

for (i in 1:11005) { Gene_symbol[i] <- strsplit(full_table$gene_assignment, " // ")[[i]][2] }

ADD REPLYlink written 4 days ago by Kim0
1

You can avoid confusion due to factors with read.table(..., stringsAsFactors = FALSE) or data.table's fread (stringsAsFactors = FALSE by default). In case you hear otherwise, overriding R's defaults to set this as FALSE globally for each session will only cause you pain in the future, but it's fine for reading files in.

EDIT: if this doesn't work for you since you're talking about another data type, try coercing to a character vector first.

ADD REPLYlink modified 6 days ago • written 6 days ago by Brice Sarver2.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1198 users visited in the last hour