Question: Splitting string in a column using character
0
gravatar for sofie_carolina
4 months ago by
Hyderabad
sofie_carolina20 wrote:

I'm trying to parse values present in rows in a column to two parts using a specific string as parser. But, unable to parse it, most of online available examples uses delimiter for their examples, but I want a small string (two letters) to act as parser. Is it recommended to do it using awk & sed ? Example:

Col1
BOT-rs10136766
BOT-rs104894363
BOT-rs10774624
BOT-rs111647200
GSA-rs117306900
GSA-rs117306950
GSA-rs117306954
GSA-rs117306975
GSA-rs117306989
BOT-seq-rs532891158.1
BOT-seq-rs794728599
DUP-rs121913344
DUP-rs12979860
DUP-seq-rs397518008
DUP-seq-rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
seq-rs794727444.1
seq-rs794727773.1
seq-rs794728252.1
seq-rs794728252.2

Here, I want to parse only rsID (rs followed with numericID) to be parsed separately from the prefixes.

awk snp sed regex • 212 views
ADD COMMENTlink modified 4 months ago by zx87547.8k • written 4 months ago by sofie_carolina20
1
sed 's/.*\(rs\w\+\).*/\1/g' test.txt
Col1
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink written 4 months ago by cpad011211k

Maybe move to answer?

ADD REPLYlink written 4 months ago by zx87547.8k

Guessing from .1, .2 suffixes, is this an output from an R script?

ADD REPLYlink written 4 months ago by zx87547.8k
2
gravatar for lakhujanivijay
4 months ago by
lakhujanivijay4.2k
India
lakhujanivijay4.2k wrote:
grep -P 'rs\d+\.?\d+?' test.txt -o

where test.txt is the file containing the ids you have mentioned above

output

rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158.1
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444.1
rs794727773.1
rs794728252.1
rs794728252.2
ADD COMMENTlink modified 4 months ago • written 4 months ago by lakhujanivijay4.2k

How to define col here, If I wish to give col ID = 1 ? And also I don't need integers present after decimal ? Like in some rs'ids I have .1, .2 .. Don't need them. Can we mention these two things in your script ?

ADD REPLYlink written 4 months ago by sofie_carolina20

Can you paste an example how should your result look like?

ADD REPLYlink written 4 months ago by lakhujanivijay4.2k

I think they just want rsXXX, drop prefixes anything before and including dash, and suffixes anything after including dot (.) .

ADD REPLYlink written 4 months ago by zx87547.8k
$grep -Po '(?<=^|-)rs\w*' test.txt  
rs10136766
rs104894363
rs10774624
rs111647200
rs117306900
rs117306950
rs117306954
rs117306975
rs117306989
rs532891158
rs794728599
rs121913344
rs12979860
rs397518008
rs397518039
rs6837175
rs6837180
rs6837215
rs6837250
rs794727444
rs794727773
rs794728252
rs794728252
ADD REPLYlink written 4 months ago by cpad011211k

try this

grep -P 'rs\d+' test.txt -o
ADD REPLYlink written 4 months ago by lakhujanivijay4.2k

I have rsid's in col2. Where to specify col name in this script ?

ADD REPLYlink written 3 months ago by sofie_carolina20

I don't want to fetch rsid's to another file. I want to print the o/p in the same col. Where rs not found that row will not be printed or it will be omitted.

ADD REPLYlink written 3 months ago by sofie_carolina20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 539 users visited in the last hour