Challenges of Affymetrix probe IDs for grouping similar genes to calculate their mean
Entering edit mode
25 days ago
Maryam • 0

Hello everyone, could you please help me?

I have an expression matrix with gene probes in rows and GEO samples in columns, on the other hand I have an annotation file of these probes with their symbols. Some symbols are duplicated, that means some symbols may have more than one probe, I want to consider all of them in one group and average them all, moreover some symbols are combined symbols and separated by slashes, but there may be symbols on their own on other lines, e.g. CARD16///CASP1, in one row is CARD16///CASP1, in another row CARD16 and in another CASP1, I want to consider them all in one group and calculate their average. Another issue is that some symbols are combined symbols and separated by a slash like the previous issue I said, but may differ by just one symbol like this: LOC101930400///AKR1C2 and LOC101930400///AKR1C2///AKR1C1, I also want to consider these two as a group and calculate the average, some of them are the same as before but they may have extra symbols e.g. LOC101930343///CATSPER2P1///CATSPER2 and LOC101930343///STRC///CATSPER2 or (DUX4L24///DBET///LOC100291626///DUX4///LOC100288289///DUX4L2///DUX4L3///DUX4L5///DUX4L6///DUX4L7///LOC652301///DUX4L4///DUX4L8///DUX4L1) and (DUX4L24///DBET///LOC100291626///DUX4///LOC100288289///LOC100287823///LOC100133400///DUX4L2///DUX4L3///DUX4L5///DUX4L6///DUX4L7///LOC652301///DUX4L4///DUX4L8///DUX3///DUX4L1), I also want to consider these in a group and calculate the average. I have run this code in R, but it is incomplete, in fact it just considers the first issue I said (CARD16///CASP1 symbol) but it does not consider the other issues I mentioned:

   data_no_batch <- read.delim("D:/GEO/data no batch.txt")
   probe_symbols <- read.delim("D:/GEO/54675probes,genes-GDCquery-feature data.txt")
   # Merge the expression matrix with the annotation file to align the gene symbols
    merged_data <- merge(probe_symbols, data_no_batch, by.x="ID", by.y="row.names", all.y=TRUE)

   # Create a new column 'Combined.Symbol' to group symbols that should be averaged together
   merged_data$Combined.Symbol <- sapply(merged_data$Gene.symbol, function(symbol) {
    # If the symbol contains '///', then all parts should be combined for averaging
     if (grepl("///", symbol)) {
      } else {
      # If the symbol does not contain '///', check if it is part of any combined symbol
      combined <- grep(paste0("(^|///)", symbol, "(///|$)"), merged_data$Gene.symbol)
      if (length(combined) > 0) {
      # Find the combined symbol that includes the current symbol
      combined_symbol <- merged_data$Gene.symbol[combined]
       # Find the one that contains '///', which indicates it's the combined form
       combined_form <- combined_symbol[grepl("///", combined_symbol)]
        # If there's a combined form, return it, otherwise return the symbol itself
        if (length(combined_form) > 0) {
         } else {
          } else {

           # Group by 'Combined.Symbol' and calculate the mean expression for each group
           mean_expression <- merged_data %>%
           group_by(Combined.Symbol) %>%
           summarise(across(starts_with("GSM"), mean, na.rm = TRUE))

Could you please give me a code that considers all issues I mentioned? I appreciate your help. Thanks in advance.

R Mean Affymetrix Probes Grouping • 125 views

Login before adding your answer.

Traffic: 1905 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6