Question: Split columns keep the first coordinate from start and end
0
gravatar for krushnach80
6 weeks ago by
krushnach80420
krushnach80420 wrote:

I have this data file trying to format it for circos plot so far i have made the data files as such the structure of my dataframe is as such

Symbol  Chr Start   End
RBM11   hs21    14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880 14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372

So what i need is I need the Symbol Chr then probably first coordinate from Start and End in the dataframe tried with various ways not been able to do it

Something like this

Symbol  Chr Start   End
RBM11   hs21 14216130 14216282

I tried this library

library(splitstackshape)

but I can't resolve .

Any simple way to resolve this issue

R • 165 views
ADD COMMENTlink modified 6 weeks ago by zx87545.7k • written 6 weeks ago by krushnach80420
4
gravatar for t.kuilman
6 weeks ago by
t.kuilman700
Netherlands
t.kuilman700 wrote:

It is usually helpful to provide an example. This can be done by using the dput() function on the variable that contains your data. In this case, I have used a data.frame called test:

> dput(test)
structure(list(Symbol = "RBM11", Chr = "hs21", Start = "14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880", 
    End = "14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372"), row.names = 2L, class = "data.frame")

In this case you can get what you want using the following code:

test[, c("Start", "End")] <- lapply(test[, c("Start", "End")], function(x) {gsub(";.*", "", x)})

Resulting in

> test
  Symbol  Chr    Start      End
2  RBM11 hs21 14216130 14216282

lapply applies a function to all the lists (columns in a data.frame) provided as the first argument (in this case, the columns named "Start" and "End"). The second argument describes the function you would like to apply, in this case function(x) {gsub(";.*", "", x)} which simply replaces everything the semicolon and everything after it by nothing (effectively clipping after the first coordinate).

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by t.kuilman700

i was thinking of giving of dput() but sorry for that next time i would do the needful .let me try it and let you know ,wonderful it worked i been struggling with it quite a while ..

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by krushnach80420
4
gravatar for cpad0112
6 weeks ago by
cpad011210k
India
cpad011210k wrote:

with sed: assumption is that columns are tab separated.

$ sed 's/\(^.*\t[0-9]\+\);.*\(\t[1-9]\+\);.*/\1\2/g' test.txt

Symbol  Chr Start   End
RBM11   hs21    14216130    14216282
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by cpad011210k
1

With sed -r (--regexp-extended), the expression becomes a lot simpler:

sed -r 's/(^.*\t[0-9]+);.*(\t[1-9]+);.*/\1\2/g' test.txt

OP is looking a solution in R though, so maybe gsub() works better?

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by RamRS19k

well since now mostly use R so i was looking for R based but sed is absolutely fine as well i need to learn sed to make my life bit easier and thanks for the clear cut solution

ADD REPLYlink written 6 weeks ago by krushnach80420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1568 users visited in the last hour