I'm a noob, so I apologize for what is probably a very basic question, but I cant quite figure out how to do what I'm trying to do correctly. I also don't think I have the vocabulary to accurately explain what it is I'm confused about, so I apologize in advance.
I have successfully replaced the ensemble IDs with gene symbols from MGI numerous times with biomart. However, I am struggling with this count file that has the ensemble ID versions
I can remove the version numbers easily using the following, and then I can use biomart to successfully convert the ensemble IDs into symbols
df <- read.csv("Tuveson_counts_LRT.csv", sep=",")
head(df)
                      X    baseMean log2FoldChange     lfcSE      stat    pvalue      padj significant
1  ENSMUSG00000000486.7   1.3283025    -0.78624588 1.5531561 0.4214789 0.9806809        NA        <NA>
2  ENSMUSG00000079557.4  31.1085926     0.08715468 0.3561105 2.7204579 0.6056395 0.9999994        <NA>
3 ENSMUSG00000026276.10 118.3799877    -0.02395615 0.1968759 0.5095415 0.9725655 0.9999994        <NA>
4  ENSMUSG00000032656.8   5.8821849    -0.15815182 0.7890379 0.2655061 0.9919307 0.9999994        <NA>
5  ENSMUSG00000022456.9   0.9019521    -1.93237167 2.0918497 1.4395258 0.8372970        NA        <NA>
6 ENSMUSG00000020486.11   5.8367904     0.12988447 0.7918816 0.6535026 0.9569368 0.9999994        <NA>
genes <- df$X
genes <- gsub("\\..*","", genes)
head(genes)
[1] "ENSMUSG00000000486" "ENSMUSG00000079557" "ENSMUSG00000026276" "ENSMUSG00000032656" "ENSMUSG00000022456"
[6] "ENSMUSG00000020486"
mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
G_list <- getBM(filters="ensembl_gene_id", 
+                 attributes= c("ensembl_gene_id", "mgi_symbol"), 
+                 values = genes,
+                 mart = mart)
head(G_list)
     ensembl_gene_id mgi_symbol
1 ENSMUSG00000000028      Cdc45
2 ENSMUSG00000000058       Cav2
3 ENSMUSG00000000088      Cox5a
4 ENSMUSG00000000127        Fer
5 ENSMUSG00000000142      Axin2
6 ENSMUSG00000000148      Brat1
Usually I would use the following to merge the output from G_list and the original df, but that wont work now since the "renamed" column is actually the value df$X.
counts_symbol <- merge(df, G_list, by.x ="X", by.y="ensembl_gene_id")
head(counts_symbol)
[1] X              baseMean       log2FoldChange lfcSE          stat           pvalue         padj          
[8] significant    mgi_symbol    
<0 rows> (or 0-length row.names)
So how do I change the actual column X in df so that the version numbers are removed, and so the merge works correctly?
TIA!
yup, that'll do it! tysm
but can you explain the difference between why this didnt work
but this did?
genes <- df$Xcopies the data from theXcolumn and assigns this copy to thegenesvariable. Since you were operating on a copy of part of the original data.frame, and not the original data.frame itself, the original data.frame remained unchanged.You could have went back and modified the original data.frame by adding this third line of code to what you have above
df$X <- genes, which is overriding the oldXcolumn with the modifiedXcolumn data saved to thegenesvariable.