Split columns keep the first coordinate from start and end
2
0
Entering edit mode
3.2 years ago
krushnach80 ▴ 1000

I have this data file trying to format it for circos plot so far i have made the data files as such the structure of my dataframe is as such

Symbol  Chr Start   End
RBM11   hs21    14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880 14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372


So what i need is I need the Symbol Chr then probably first coordinate from Start and End in the dataframe tried with various ways not been able to do it

Something like this

Symbol  Chr Start   End
RBM11   hs21 14216130 14216282


I tried this library

library(splitstackshape)


but I can't resolve .

Any simple way to resolve this issue

R • 634 views
4
Entering edit mode
3.2 years ago

It is usually helpful to provide an example. This can be done by using the dput() function on the variable that contains your data. In this case, I have used a data.frame called test:

> dput(test)
structure(list(Symbol = "RBM11", Chr = "hs21", Start = "14216130;14216145;14216153;14216154;14216178;14219553;14219553;14219563;14219563;14221097;14221097;14221097;14221097;14221097;14224375;14224438;14224438;14224438;14224438;14226859;14226880;14226880;14226880;14226880",
End = "14216282;14216282;14216282;14216282;14216282;14219725;14219725;14219725;14219725;14221453;14221169;14221169;14221169;14221169;14224537;14224537;14224537;14224537;14224537;14227054;14228372;14227384;14228173;14228372"), row.names = 2L, class = "data.frame")


In this case you can get what you want using the following code:

test[, c("Start", "End")] <- lapply(test[, c("Start", "End")], function(x) {gsub(";.*", "", x)})


Resulting in

> test
Symbol  Chr    Start      End
2  RBM11 hs21 14216130 14216282


lapply applies a function to all the lists (columns in a data.frame) provided as the first argument (in this case, the columns named "Start" and "End"). The second argument describes the function you would like to apply, in this case function(x) {gsub(";.*", "", x)} which simply replaces everything the semicolon and everything after it by nothing (effectively clipping after the first coordinate).

0
Entering edit mode

i was thinking of giving of dput() but sorry for that next time i would do the needful .let me try it and let you know ,wonderful it worked i been struggling with it quite a while ..

4
Entering edit mode
3.2 years ago

with sed: assumption is that columns are tab separated.

\$ sed 's/$$^.*\t[0-9]\+$$;.*$$\t[1-9]\+$$;.*/\1\2/g' test.txt

Symbol  Chr Start   End
RBM11   hs21    14216130    14216282

1
Entering edit mode

With sed -r (--regexp-extended), the expression becomes a lot simpler:

sed -r 's/(^.*\t[0-9]+);.*(\t[1-9]+);.*/\1\2/g' test.txt


OP is looking a solution in R though, so maybe gsub() works better?

0
Entering edit mode

well since now mostly use R so i was looking for R based but sed is absolutely fine as well i need to learn sed to make my life bit easier and thanks for the clear cut solution