Question: Split columns keep the first coordinate from start and end
0
gravatar for krushnach80
3 months ago by
krushnach80440
krushnach80440 wrote:

I have this data file trying to format it for circos plot so far i have made the data files as such the structure of my dataframe is as such

Symbol  Chr Start   End
RBM11   hs21    14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880 14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372

So what i need is I need the Symbol Chr then probably first coordinate from Start and End in the dataframe tried with various ways not been able to do it

Something like this

Symbol  Chr Start   End
RBM11   hs21 14216130 14216282

I tried this library

library(splitstackshape)

but I can't resolve .

Any simple way to resolve this issue

R • 207 views
ADD COMMENTlink modified 3 months ago by zx87546.2k • written 3 months ago by krushnach80440
4
gravatar for cpad0112
3 months ago by
cpad011210k
India
cpad011210k wrote:

with sed: assumption is that columns are tab separated.

$ sed 's/\(^.*\t[0-9]\+\);.*\(\t[1-9]\+\);.*/\1\2/g' test.txt

Symbol  Chr Start   End
RBM11   hs21    14216130    14216282
ADD COMMENTlink modified 3 months ago • written 3 months ago by cpad011210k
1

With sed -r (--regexp-extended), the expression becomes a lot simpler:

sed -r 's/(^.*\t[0-9]+);.*(\t[1-9]+);.*/\1\2/g' test.txt

OP is looking a solution in R though, so maybe gsub() works better?

ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS20k

well since now mostly use R so i was looking for R based but sed is absolutely fine as well i need to learn sed to make my life bit easier and thanks for the clear cut solution

ADD REPLYlink written 3 months ago by krushnach80440
3
gravatar for t.kuilman
3 months ago by
t.kuilman720
Netherlands
t.kuilman720 wrote:

It is usually helpful to provide an example. This can be done by using the dput() function on the variable that contains your data. In this case, I have used a data.frame called test:

> dput(test)
structure(list(Symbol = "RBM11", Chr = "hs21", Start = "14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880", 
    End = "14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372"), row.names = 2L, class = "data.frame")

In this case you can get what you want using the following code:

test[, c("Start", "End")] <- lapply(test[, c("Start", "End")], function(x) {gsub(";.*", "", x)})

Resulting in

> test
  Symbol  Chr    Start      End
2  RBM11 hs21 14216130 14216282

lapply applies a function to all the lists (columns in a data.frame) provided as the first argument (in this case, the columns named "Start" and "End"). The second argument describes the function you would like to apply, in this case function(x) {gsub(";.*", "", x)} which simply replaces everything the semicolon and everything after it by nothing (effectively clipping after the first coordinate).

ADD COMMENTlink modified 3 months ago • written 3 months ago by t.kuilman720

i was thinking of giving of dput() but sorry for that next time i would do the needful .let me try it and let you know ,wonderful it worked i been struggling with it quite a while ..

ADD REPLYlink modified 3 months ago • written 3 months ago by krushnach80440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2354 users visited in the last hour