Question: Splitting string in a column using character
0
gravatar for sofie_carolina
8 days ago by
Hyderabad
sofie_carolina0 wrote:

I'm trying to parse values present in rows in a column to two parts using a specific string as parser. But, unable to parse it, most of online available examples uses delimiter for their examples, but I want a small string (two letters) to act as parser. Is it recommended to do it using awk & sed ? Example:

Col1
BOT-rs10136766
BOT-rs104894363
BOT-rs10774624
BOT-rs111647200
GSA-rs117306900
GSA-rs117306950
GSA-rs117306954
GSA-rs117306975
GSA-rs117306989
BOT-seq-rs532891158.1
BOT-seq-rs794728599
DUP-rs121913344
DUP-rs12979860
DUP-seq-rs397518008
DUP-seq-rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
seq-rs794727444.1
seq-rs794727773.1
seq-rs794728252.1
seq-rs794728252.2

Here, I want to parse only rsID (rs followed with numericID) to be parsed separately from the prefixes.

awk snp sed regex • 101 views
ADD COMMENTlink modified 8 days ago by zx87546.8k • written 8 days ago by sofie_carolina0
1
sed 's/.*\(rs\w\+\).*/\1/g' test.txt
Col1
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink written 8 days ago by cpad011211k

Maybe move to answer?

ADD REPLYlink written 8 days ago by zx87546.8k

Guessing from .1, .2 suffixes, is this an output from an R script?

ADD REPLYlink written 8 days ago by zx87546.8k
2
gravatar for bioExplorer
8 days ago by
bioExplorer3.7k
bioExplorer3.7k wrote:
grep -P 'rs\d+\.?\d+?' test.txt -o

where test.txt is the file containing the ids you have mentioned above

output

rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158.1
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444.1
rs794727773.1
rs794728252.1
rs794728252.2
ADD COMMENTlink modified 8 days ago • written 8 days ago by bioExplorer3.7k

How to define col here, If I wish to give col ID = 1 ? And also I don't need integers present after decimal ? Like in some rs'ids I have .1, .2 .. Don't need them. Can we mention these two things in your script ?

ADD REPLYlink written 8 days ago by sofie_carolina0

Can you paste an example how should your result look like?

ADD REPLYlink written 8 days ago by bioExplorer3.7k

I think they just want rsXXX, drop prefixes anything before and including dash, and suffixes anything after including dot (.) .

ADD REPLYlink written 8 days ago by zx87546.8k
$grep -Po '(?<=^|-)rs\w*' test.txt  
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink written 8 days ago by cpad011211k

try this

grep -P 'rs\d+' test.txt -o
ADD REPLYlink written 8 days ago by bioExplorer3.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2219 users visited in the last hour