Question

Unintended behaviour when trying to remove gene version from ENSG

0

Entering edit mode

8 months ago

BioInfoBeginner ▴ 50

Hi all,

When I remove the gene version numbers from from the ENSG ID, those genes with the, "_PAR_", suffix e.g.

"ENSG00000002586.20_PAR_Y" "ENSG00000124333.16_PAR_Y" "ENSG00000124334.17_PAR_Y" "ENSG00000167393.18_PAR_Y" "ENSG00000169084.15_PAR_Y"

aren't being removed. I have tried using the following (obtained from stack) with no success

str_replace(rownames(data), pattern = ".[0-9]+$", replacement = "")

gsub("\\..*","", rownames(data))

tools::file_path_sans_ext(rownames(data))

But these troublesome gene versions remain. Any thoughts?

R grep regex ENSG • 673 views

ADD COMMENT • link updated 8 months ago by LChart 3.9k • written 8 months ago by BioInfoBeginner ▴ 50

score 2 · Answer 1 · 2023-08-06

2

Entering edit mode

8 months ago

LChart 3.9k

This pattern: str_replace(rownames(data), pattern = ".[0-9]+$", replacement = "") removes any trailing numbers from a string, and one character before where the number sequence starts. It fails because none of these IDs end in a number.

This pattern: gsub("\\..*","", rownames(data)) removes the first period and everything following it, and in fact it should work fine:

> gsub('\\..*', '', 'ENSG00000213123.16_PAR_Y')
[1] "ENSG00000213123"

could it be that you forgot to reassign the rownames, i.e.:

rownames(data) <- gsub("\\..*","", rownames(data))

?

ADD COMMENT • link 8 months ago by LChart 3.9k

0

Entering edit mode

Thanks for the explanations LChart. Regrettably, reassigning is not missing from my script.

ADD REPLY • link 8 months ago by BioInfoBeginner ▴ 50

0

Entering edit mode

It may behoove you to paste the relevant 5-10 lines of code from your script if you want more specific help. All I can say is that the bad behavior you observe is due to something other than the gsub command that you posted.

ADD REPLY • link 8 months ago by LChart 3.9k