Extracting ensembl gene id from messy data frames
2
0
Entering edit mode
3.3 years ago

Hi everyone!

I have an issue extracting ensembl gene ids from a messy data frame. First, I loaded the csv file in R (file that was not separated by commas) and looks like:

> my_csv_file
               ensembl_gene_id.entrezgene_id.hgnc_symbol.gene_biotype
1                           1 ENSG00000174365 128439 SNHG11 lncRNA
2 2 ENSG00000180385 NA EMC3-AS1 transcribed_unprocessed_pseudogene
3                                     3 ENSG00000183562 NA  lncRNA
4  4 ENSG00000205266 NA KRT17P5 transcribed_unprocessed_pseudogene
5                            5 ENSG00000206585 26864 RNVU1-7 snRNA
6                              6 ENSG00000206588 NA RNU1-28P snRNA

Then, I tried to extract the ensembl gene id from each row using sub function. For example, for row number 1:

> sub("^\\d", "", my_csv_file[1, ]
[1] " ENSG00000174365 128439 SNHG11 lncRNA"

However, I'm stuck because I donĀ“t know how to remove the alphanumeric characters after the ensembl id by using regular expressions and then put it inside a for loop.

I appreciate your help.

Best regards.

R RNA-Seq ChIP-Seq • 720 views
ADD COMMENT
1
Entering edit mode
3.3 years ago
ATpoint 81k

So the question with this example would be how to keep only ENSG00000174365 when there are whitespaces all over the place?

foo <- "  ENSG00000174365 128439 SNHG11 lncRNA"
gsub("\\ .*", "", trimws(x = foo, which = "left"))

Please give a reproducible example using dput().

ADD COMMENT
0
Entering edit mode

Exactly, I want to keep the ensembl id's from the original df.

ADD REPLY
1
Entering edit mode
3.3 years ago
Ram 43k

Your "csv" file is space separated. It might be easy to just re-import with sep=" ".

ADD COMMENT
0
Entering edit mode

Problem solved, thanks for both approaches!

ADD REPLY
0
Entering edit mode

I've moved both comments to answers. Please accept both if they worked for you.

ADD REPLY

Login before adding your answer.

Traffic: 1564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6