Unintended behaviour when trying to remove gene version from ENSG
1
0
Entering edit mode
8 months ago

Hi all,

When I remove the gene version numbers from from the ENSG ID, those genes with the, "_PAR_", suffix e.g.

"ENSG00000002586.20_PAR_Y" "ENSG00000124333.16_PAR_Y" "ENSG00000124334.17_PAR_Y" "ENSG00000167393.18_PAR_Y" "ENSG00000169084.15_PAR_Y"

aren't being removed. I have tried using the following (obtained from stack) with no success

str_replace(rownames(data), pattern = ".[0-9]+$", replacement = "")

gsub("\\..*","", rownames(data))

tools::file_path_sans_ext(rownames(data))

But these troublesome gene versions remain. Any thoughts?

R grep regex ENSG • 673 views
ADD COMMENT
2
Entering edit mode
8 months ago
LChart 3.9k

This pattern: str_replace(rownames(data), pattern = ".[0-9]+$", replacement = "") removes any trailing numbers from a string, and one character before where the number sequence starts. It fails because none of these IDs end in a number.

This pattern: gsub("\\..*","", rownames(data)) removes the first period and everything following it, and in fact it should work fine:

> gsub('\\..*', '', 'ENSG00000213123.16_PAR_Y')
[1] "ENSG00000213123"

could it be that you forgot to reassign the rownames, i.e.:

rownames(data) <- gsub("\\..*","", rownames(data))

?

ADD COMMENT
0
Entering edit mode

Thanks for the explanations LChart. Regrettably, reassigning is not missing from my script.

ADD REPLY
0
Entering edit mode

It may behoove you to paste the relevant 5-10 lines of code from your script if you want more specific help. All I can say is that the bad behavior you observe is due to something other than the gsub command that you posted.

ADD REPLY

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6