Question

Very basic R question: How do I combine this dataframe with a "value"?

0

Entering edit mode

3.1 years ago

cdeantoneo31 ▴ 20

I'm a noob, so I apologize for what is probably a very basic question, but I cant quite figure out how to do what I'm trying to do correctly. I also don't think I have the vocabulary to accurately explain what it is I'm confused about, so I apologize in advance.

I have successfully replaced the ensemble IDs with gene symbols from MGI numerous times with biomart. However, I am struggling with this count file that has the ensemble ID versions

I can remove the version numbers easily using the following, and then I can use biomart to successfully convert the ensemble IDs into symbols

df <- read.csv("Tuveson_counts_LRT.csv", sep=",")
head(df)
                      X    baseMean log2FoldChange     lfcSE      stat    pvalue      padj significant
1  ENSMUSG00000000486.7   1.3283025    -0.78624588 1.5531561 0.4214789 0.9806809        NA        <NA>
2  ENSMUSG00000079557.4  31.1085926     0.08715468 0.3561105 2.7204579 0.6056395 0.9999994        <NA>
3 ENSMUSG00000026276.10 118.3799877    -0.02395615 0.1968759 0.5095415 0.9725655 0.9999994        <NA>
4  ENSMUSG00000032656.8   5.8821849    -0.15815182 0.7890379 0.2655061 0.9919307 0.9999994        <NA>
5  ENSMUSG00000022456.9   0.9019521    -1.93237167 2.0918497 1.4395258 0.8372970        NA        <NA>
6 ENSMUSG00000020486.11   5.8367904     0.12988447 0.7918816 0.6535026 0.9569368 0.9999994        <NA>

genes <- df$X
genes <- gsub("\\..*","", genes)
head(genes)
[1] "ENSMUSG00000000486" "ENSMUSG00000079557" "ENSMUSG00000026276" "ENSMUSG00000032656" "ENSMUSG00000022456"
[6] "ENSMUSG00000020486"

mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
G_list <- getBM(filters="ensembl_gene_id", 
+                 attributes= c("ensembl_gene_id", "mgi_symbol"), 
+                 values = genes,
+                 mart = mart)
head(G_list)
     ensembl_gene_id mgi_symbol
1 ENSMUSG00000000028      Cdc45
2 ENSMUSG00000000058       Cav2
3 ENSMUSG00000000088      Cox5a
4 ENSMUSG00000000127        Fer
5 ENSMUSG00000000142      Axin2
6 ENSMUSG00000000148      Brat1

Usually I would use the following to merge the output from G_list and the original df, but that wont work now since the "renamed" column is actually the value df$X.

counts_symbol <- merge(df, G_list, by.x ="X", by.y="ensembl_gene_id")
head(counts_symbol)
[1] X              baseMean       log2FoldChange lfcSE          stat           pvalue         padj          
[8] significant    mgi_symbol    
<0 rows> (or 0-length row.names)

So how do I change the actual column X in df so that the version numbers are removed, and so the merge works correctly?

TIA!

rstudio ensembl R biomart • 1.2k views

ADD COMMENT • link updated 2.6 years ago by Ram 45k • written 3.1 years ago by cdeantoneo31 ▴ 20

score 0 · Answer 1 · 2022-09-19

0

Entering edit mode

3.1 years ago

rpolicastro 13k

df$X <- gsub("\\.\\d+$", "", df$X)

ADD COMMENT • link 3.1 years ago by rpolicastro 13k

0

Entering edit mode

yup, that'll do it! tysm

but can you explain the difference between why this didnt work

genes <- df$X
genes <- gsub("\\..*","", genes)

but this did?

ADD REPLY • link 3.1 years ago by cdeantoneo31 ▴ 20

0

Entering edit mode

genes <- df$X copies the data from the X column and assigns this copy to the genes variable. Since you were operating on a copy of part of the original data.frame, and not the original data.frame itself, the original data.frame remained unchanged.

You could have went back and modified the original data.frame by adding this third line of code to what you have above df$X <- genes, which is overriding the old X column with the modified X column data saved to the genes variable.

ADD REPLY • link 3.1 years ago by rpolicastro 13k