Splitting string in a column using character
1
0
Entering edit mode
2.7 years ago

I'm trying to parse values present in rows in a column to two parts using a specific string as parser. But, unable to parse it, most of online available examples uses delimiter for their examples, but I want a small string (two letters) to act as parser. Is it recommended to do it using awk & sed ? Example:

Col1
BOT-rs10136766
BOT-rs104894363
BOT-rs10774624
BOT-rs111647200
GSA-rs117306900
GSA-rs117306950
GSA-rs117306954
GSA-rs117306975
GSA-rs117306989
BOT-seq-rs532891158.1
BOT-seq-rs794728599
DUP-rs121913344
DUP-rs12979860
DUP-seq-rs397518008
DUP-seq-rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
seq-rs794727444.1
seq-rs794727773.1
seq-rs794728252.1
seq-rs794728252.2


Here, I want to parse only rsID (rs followed with numericID) to be parsed separately from the prefixes.

SNP regex awk sed • 627 views
1
Entering edit mode
sed 's/.*$$rs\w\+$$.*/\1/g' test.txt
Col1
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252

0
Entering edit mode

0
Entering edit mode

Guessing from .1, .2 suffixes, is this an output from an R script?

2
Entering edit mode
2.7 years ago
grep -P 'rs\d+\.?\d+?' test.txt -o


where test.txt is the file containing the ids you have mentioned above

output

rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158.1
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444.1
rs794727773.1
rs794728252.1
rs794728252.2

0
Entering edit mode

How to define col here, If I wish to give col ID = 1 ? And also I don't need integers present after decimal ? Like in some rs'ids I have .1, .2 .. Don't need them. Can we mention these two things in your script ?

0
Entering edit mode

Can you paste an example how should your result look like?

0
Entering edit mode

I think they just want rsXXX, drop prefixes anything before and including dash, and suffixes anything after including dot (.) .

0
Entering edit mode
\$grep -Po '(?<=^|-)rs\w*' test.txt
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252

0
Entering edit mode

try this

grep -P 'rs\d+' test.txt -o

0
Entering edit mode

I have rsid's in col2. Where to specify col name in this script ?

0
Entering edit mode

I don't want to fetch rsid's to another file. I want to print the o/p in the same col. Where rs not found that row will not be printed or it will be omitted.