Question: (Closed) Extracting rows based on specific string in a column
0
gravatar for jivarajivaraj
16 months ago by
jivarajivaraj50 wrote:

Hi,

I have this file

> head(dat)
          GENE              ID        Gene products
1 DDB_G0267364         Skipper     GAG-PRO         
2 DDB_G0267372          TRE5-A        ORF1         
3 DDB_G0267380 acetylornithine deacetylase         
4 DDB_G0267304           DIRS1        ORF3 fragment
5 DDB_G0267338           DIRS1        putative myb transcription factor         
6 DDB_G0267356         Skipper GAG-PRO-POL         
>

I want to extract genes in GENE column that have "transcription factor" in products column in R. How can I do that please?

genome R gene • 446 views
ADD COMMENTlink modified 16 months ago • written 16 months ago by jivarajivaraj50
1

Please share reproducible data dput(head(dat)), also avoid using spaces in column names. Most likely you need:

dat[ grepl(dat$GeneProducts, "transcription factor", fixed = TRUE), "GENE" ]
ADD REPLYlink written 16 months ago by zx87548.7k

R has no way of knowing which name represents a transcription factor. Is the last column expected to contain transcription factor phrase/string?

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax75k

product column is missing in OP. @ jivarajivaraj

ADD REPLYlink written 16 months ago by cpad011212k

In products column for some gene there is transcription factor phrase/string

ADD REPLYlink written 16 months ago by jivarajivaraj50

Hello jivarajivaraj!

This is a pure R question. Please search Stack Overflow

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

PS: OP, you've done a good job in framing the title of the post, and I see that you're good at extracting key phrases such as "extracting rows" "specific string in column". This is good work, and if you use these to google, you'll get to your solution faster and also enhance and refine your key-phrase building skill, which will help you search a lot more efficiently.

I know it sounds upside down (you learning Google instead of Google learning Natural Language Processing) but the approach works well for now :-)

ADD REPLYlink modified 16 months ago • written 16 months ago by RamRS25k
1
gravatar for Joe
16 months ago by
Joe15k
United Kingdom
Joe15k wrote:

This is a basic R programming question and is off-topic here.

Besides, this problem is well answered elsewhere:

https://stackoverflow.com/questions/13043928/selecting-rows-where-a-column-has-a-string-like-hsa-partial-string-match

https://favorableoutcomes.wordpress.com/2012/04/17/selecting-rows-by-partial-name-match-in-r/

ad infinitum

ADD COMMENTlink written 16 months ago by Joe15k

Sorry but I have already tried some links

> dat[dat$products %like% "transcription factor",]
[1] GENE     ID       Gene     products
<0 rows> (or 0-length row.names)

while I know there are 170 transcription factor string in products column

Sorry whatever I am trying to read my file there would be space between column names fro example Gene products columns would be splited to Gene and products columns

dat <- read.table("gene_information-5.txt",fill=TRUE,header = TRUE)

ADD REPLYlink modified 16 months ago • written 16 months ago by jivarajivaraj50

See examples here: https://www.rdocumentation.org/packages/DescTools/versions/0.99.19/topics/%25like%25

What country are you based out of? Do you have access to StackOverflow and Google? The answers to your questions are one google search away, so I'm curious why you're having to ask us in the first place.

ADD REPLYlink written 16 months ago by RamRS25k

I also wonder why you are always rushing to close the posts more than another moderators? May be sometimes later somebody wants to help, I have already searched StackOverflow and Google and for sure if my answer was there I did not create a question here

ADD REPLYlink modified 16 months ago • written 16 months ago by jivarajivaraj50

I have already searched StackOverflow and Google ...

Sorry, jrj.healey found the SO post in a couple of minutes of searching, so I do not believe due diligence was done on your part.

As for the moderators' internal distribution of duties, you'll be privy to that once you're one of us someday (which I genuinely look forward to)

ADD REPLYlink written 16 months ago by RamRS25k

But all comes from heart, for example genomax I never seen to close abruptly he tries to help if not leave me with another members

does not matter the country I am but I believe for hours I am searching in google

GENE ID Gene Name   Synonyms    Gene products
DDB_G0267364    DDB_G0267364_RTE        Skipper GAG-PRO
DDB_G0267372    DDB_G0267372_RTE        TRE5-A ORF1
DDB_G0267380    argE    P52D    acetylornithine deacetylase
DDB_G0267304    DDB_G0267304_RTE        DIRS1 ORF3 fragment
DDB_G0267338    DDB_G0267338_RTE        DIRS1 ORF3
DDB_G0267356    DDB_G0267356_RTE        Skipper GAG-PRO-POL
DDB_G0269812    DDB_G0269812        
DDB_G0269818    DDB_G0269818    sigN162 hssA/2C/7E family protein
DDB_G0267990    ypel        yippee-like protein
DDB_G0267992    DDB_G0267992        dual-specificity protein phosphatase
DDB_G0267994    DDB_G0267994        DNAJ heat shock N-terminal domain-containing protein
DDB_G0269192    tifA    eIF4AIII    DEAD/DEAH box helicase domain-containing protein, eukaryotic translation initiation factor 4A
DDB_G0268020    bkdB    BCKDHB, 2-oxoisovalerate dehydrogenase subunit beta, mitochondrial  branched-chain alpha-keto acid dehydrogenase E1 beta chain
DDB_G0295197    tRNA-Tyr-GUA-8  tRNA    tyrosine transfer RNA
DDB_G0269222    gefB    RasGEFB, RasGEF Ras guanine nucleotide exchange factor
DDB_G0269826    DDB_G0269826        
DDB_G0269852    psaB_ps     pseudogene
DDB_G0269856    ddcA    DAPDC   group IV decarboxylase, Orn/DAP/Arg decarboxylase 2 domain-containing protein, putative diaminopimelate decarboxylase
DDB_G0269864    DDB_G0269864        Small conductance calcium-activated potassium channel protein 3
DDB_G0270464    DDB_G0270464_RTE        TRE5-A ORF2
DDB_G0267432    abcG15      ABC transporter G family protein
DDB_G0267280    DDB_G0267280_TE     DDT-A
DDB_G0267964    DDB_G0267964        unknown
DDB_G0267966    pyd1        dihydropyrimidine dehydrogenase
DDB_G0268588    DDB_G0268588        
DDB_G0269178    racG        Rho GTPase
DDB_G0269800    DDB_G0269800        putative chromatin assembly factor 1 subunit B
DDB_G0269082    DDB_G0269082        unknown
DDB_G0269636    DDB_G0269636        EF-hand domain-containing protein, EPS15 homology (EH) domain-containing protein
DDB_G0269642    mak16l      MAK16-like protein
DDB_G0269658    DDB_G0269658        Methyltransferase-like protein 11A
DDB_G0269664    DDB_G0269664        
DDB_G0269984    DDB_G0269984        
DDB_G0269988    DDB_G0269988        
DDB_G0269546    DDB_G0269546        C2 calcium-dependent membrane targeting domain-containing protein
DDB_G0268318    DDB_G0268318        
DDB_G0267858    DDB_G0267858        
DDB_G0269370    DDB_G0269370        NUDIX hydrolase family protein, cleavage and polyadenylation specificity factor 5-like protein
DDB_G0268344    DDB_G0268344        
DDB_G0268934    DDB_G0268934        LYR motif-containing protein
DDB_G0268942    DDB_G0268942_RTE        TRE3-B ORF1
DDB_G0268948    DDB_G0268948        putative SAM dependent methyltransferase
DDB_G0269090    DDB_G0269090        
DDB_G0269840    DDB_G0269840        transmembrane protein
DDB_G0267542    DDB_G0267542        
DDB_G0267552    DDB_G0267552        putative NADH dehydrogenase (ubiquinone), putative NADH-ubiquinone oxidoreductase 13 kDa subunit
DDB_G0267560    uprt        uracil phosphoribosyltransferase
DDB_G0268208    DDB_G0268208        
DDB_G0269412    paf1        RNA polymerase II-associated factor 1
DDB_G0268220    DDB_G0268220        leucine-rich repeat-containing protein
DDB_G0269538    DDB_G0269538        
DDB_G0267698    DDB_G0267698        rhodanese-like domain-containing protein
DDB_G0267700    DDB_G0267700        
DDB_G0269100    abpC    gelA, ddFLN, FLN, ABP120, ABP-120, gelation factor, filamin actin binding protein C
DDB_G0270430    wipA    WIPa    WH2 domain-containing protein
DDB_G0270988    samkB_ps1       pseudogene
> dat[dat$products %like% "%transcription factor%",]
[1] GENE     ID       Gene     products
<0 rows> (or 0-length row.names)
> dat[dat$products %like% "%transcription%",]
[1] GENE     ID       Gene     products
<0 rows> (or 0-length row.names)
> dat[dat$products %like% "%factor%",]
[1] GENE     ID       Gene     products
<0 rows> (or 0-length row.names)
>
ADD REPLYlink modified 16 months ago • written 16 months ago by jivarajivaraj50

The country does matter - a lot of websites are blocked in Iran for example, and we'd understand if an Iranian user had difficulties finding resources easily available to others. I'm not racially profiling you, I'm trying to relate to your circumstances.

You seem to know the key phrases, so I wonder why Google won't give you the same links it gives us.

ADD REPLYlink written 16 months ago by RamRS25k

The data you posted above has only 3 columns, this one has 4. Can you update your post above with the actual head(dat) please?

ADD REPLYlink written 16 months ago by RamRS25k

Why are you putting % around the string?

ADD REPLYlink written 16 months ago by Joe15k

The examples here do: https://www.rdocumentation.org/packages/DescTools/versions/0.99.19/topics/%25like%25

ADD REPLYlink written 16 months ago by RamRS25k

Around %like% yes, but not around the string to search for. See my comments below. OP should be trying

> dat[dat$products %like% "transcription",]

not

> dat[dat$products %like% "%transcription%",]
ADD REPLYlink written 16 months ago by Joe15k

OP tried that too. Maybe they need library(data.table)?

ADD REPLYlink written 16 months ago by RamRS25k

Then there is something wrong with your dataframe. The columns must not line up like you expect. Your 3rd row looks wrong to me already.

Are you handling cases where you have no data correctly?

The example from that SO post works perfectly with some toy data:

> install.packages("data.table")
> library(data.table)
> mtcars[mtcars$mpg %like% "2", ]
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710       22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 230         22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280         19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 450SLC      15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Fiat 128         32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
AMC Javelin      15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Pontiac Firebird 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

> mtcars[mtcars$cyl %like% "8", ]
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
ADD REPLYlink modified 16 months ago • written 16 months ago by Joe15k

Try on a character column, or a factor column. You'll probably need wildcards around the query string.

ADD REPLYlink written 16 months ago by RamRS25k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 950 users visited in the last hour