Question

Split columns keep the first coordinate from start and end

0

Entering edit mode

5.5 years ago

1769mkc ★ 1.2k

I have this data file trying to format it for circos plot so far i have made the data files as such the structure of my dataframe is as such

Symbol  Chr Start   End
RBM11   hs21    14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880 14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372

So what i need is I need the Symbol Chr then probably first coordinate from Start and End in the dataframe tried with various ways not been able to do it

Something like this

Symbol  Chr Start   End
RBM11   hs21 14216130 14216282

I tried this library

library(splitstackshape)

but I can't resolve .

Any simple way to resolve this issue

R • 1.1k views

ADD COMMENT • link updated 5.5 years ago by zx8754 11k • written 5.5 years ago by 1769mkc ★ 1.2k

score 4 · Accepted Answer · 2018-10-02

It is usually helpful to provide an example. This can be done by using the dput() function on the variable that contains your data. In this case, I have used a data.frame called test:

> dput(test)
structure(list(Symbol = "RBM11", Chr = "hs21", Start = "14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880", 
    End = "14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372"), row.names = 2L, class = "data.frame")

In this case you can get what you want using the following code:

test[, c("Start", "End")] <- lapply(test[, c("Start", "End")], function(x) {gsub(";.*", "", x)})

Resulting in

> test
  Symbol  Chr    Start      End
2  RBM11 hs21 14216130 14216282

lapply applies a function to all the lists (columns in a data.frame) provided as the first argument (in this case, the columns named "Start" and "End"). The second argument describes the function you would like to apply, in this case function(x) {gsub(";.*", "", x)} which simply replaces everything the semicolon and everything after it by nothing (effectively clipping after the first coordinate).

score 4 · Accepted Answer · 2018-10-02

4

Entering edit mode

5.5 years ago

cpad0112 21k

with sed: assumption is that columns are tab separated.

$ sed 's/\(^.*\t[0-9]\+\);.*\(\t[1-9]\+\);.*/\1\2/g' test.txt

Symbol  Chr Start   End
RBM11   hs21    14216130    14216282

ADD COMMENT • link 5.5 years ago by cpad0112 21k

1

Entering edit mode

With sed -r (--regexp-extended), the expression becomes a lot simpler:

sed -r 's/(^.*\t[0-9]+);.*(\t[1-9]+);.*/\1\2/g' test.txt

OP is looking a solution in R though, so maybe gsub() works better?

ADD REPLY • link 5.5 years ago by Ram 43k

0

Entering edit mode

well since now mostly use R so i was looking for R based but sed is absolutely fine as well i need to learn sed to make my life bit easier and thanks for the clear cut solution

ADD REPLY • link 5.5 years ago by 1769mkc ★ 1.2k