Question: how to use regular expression to grasp information before "GO:" or "#N/A"
0
gravatar for Yingzi Zhang
2.2 years ago by
Yingzi Zhang60
Jeddah
Yingzi Zhang60 wrote:

Hi all, I don't know whether it's polite to ask this direct simple question in biostars. But it do trouble me the whole day.

My raw data is like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    #N/A
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily        GO:0003924|GO:0005525
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter     GO:0015098|GO:0015689|GO:0016021
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  GO:0015293|GO:0016021
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic   GO:0003690|GO:0006355

I want to grasp all the letters before "GO:" or "#N/A". The result expected should be like this:

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1
ENSSSCP00000041172.1    Ras family      Small GTPase superfamily
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM      Molybdate-anion transporter    
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

The scprit I wrote was:

for line in rawdata:
    value1 = re.search("^(^'GO:']+)",line)
    value2 = re.search("^(^'#N/A']+)",line)
    if value1:
        print(value1.group(1))
    if value2:
        print(value2.group(1))

No error reported but output empty also. How please? Thank you for you patience.

Yingzi

python • 502 views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Yingzi Zhang60
1
for line in rawdata:
    value1 = re.search("GO.*",line)
    value2 = re.search("#N/A.*",line)
    if value1:
        print(value1.group(0))
    if value2:
        print(value2.group(0))
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by mohammadhassanj110
1

Yingzi Zhang well, though you requested for python solution, here is another solution in bash:

$  grep -Po -i '.*\t(?=[GO:|\#N]*)' test.txt

ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic

if your text is well formatted, you would have to simply exclude last column (based on OP text) some thing like this:

$ awk '{$NF=""}1' test.txt 
ENSSSCP00000055957.1 Protein of unknown function (DUF1466) RCS1 
ENSSSCP00000041172.1 Ras family Small GTPase superfamily 
ENSSSCP00000041839.1 Sugar-tranasporters, 12 TM Molybdate-anion transporter 
ENSSSCP00000004168.3 Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter 
ENSSSCP00000040645.1 mTERF Transcription termination factor, mitochondrial/chloroplastic

In python you can use zerolength assertions as well (mind the indent and test.txt is text in OP):

>import re
>with open("test.txt", "r") as f:
    test = f.readlines()
>out = [re.search(r'.*\t(?=[GO|\\#])', i).group(0) for i in test]
>print(*out, sep='\n')


ENSSSCP00000055957.1    Protein of unknown function (DUF1466)   RCS1    
ENSSSCP00000041172.1    Ras family  Small GTPase superfamily    
ENSSSCP00000041839.1    Sugar-tranasporters, 12 TM  Molybdate-anion transporter 
ENSSSCP00000004168.3    Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter  
ENSSSCP00000040645.1    mTERF   Transcription termination factor, mitochondrial/chloroplastic
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by cpad011214k

Thank you so much. :)

ADD REPLYlink written 2.2 years ago by Yingzi Zhang60

Another solution using bash:

$ cut -f1-3 test.txt

Why you post you answer as a comment and not as in answer @ cpad0112 ?

fin swimmer

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by finswimmer13k

Thank you for the suggestion. I couldn't find a answer button. I could only see "ADD REPLY" so I clicked it. sorry, are you able to modify this?

Yingzi

ADD REPLYlink written 2.2 years ago by Yingzi Zhang60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 875 users visited in the last hour