Question: Python: grep the strings whiitn "[ ]"
0
gravatar for horsedog
2.9 years ago by
horsedog60
horsedog60 wrote:

Hi ,I'm using bioinformatics tool parsing my sequences, here I'd like to extract some information i need. There are thousands of query names corresponding to different sequences, like this

  >lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

What I need is "[location=(207..914)]" ; How I can achieve this? In different sequences the name would be different, I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one, and sometimes there is no "locations", meaning no cds in this sequence so just give it a miss. I'm thinking to use "grep" or "re.search" but it didn't work:

for line in open(file,"r").readlines():   
  if "location=" in line:  
    cds = grep “[location = *]” line  
  print(cds)

Does anyone have idea?
Many thanks!

python • 825 views
ADD COMMENTlink modified 2.9 years ago by cpad011214k • written 2.9 years ago by horsedog60

If you want to stick with re then

for line in open("test", "r").readlines():
        if "location" in line:
                loc = re.split(r" ", line)
                for m in loc:
                        if "location" in m:
                                print(m)
ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by genomax92k

good one:). Further shortening the code:

import re
for line in open("test.txt", "r").readlines():
    if "location" in line:
        print(line.split()[5])

output:

[location=(207..914)]
[location=(2070..9140)]
[location=(20700..91400)]

input:

$ cat test.txt
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]
ADD REPLYlink written 2.9 years ago by cpad011214k

I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one

otherwise: grep -e 'location' myfile.fasta | cut -f 6 -d ' ' > locations.txt

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by st.ph.n2.5k
1
gravatar for st.ph.n
2.9 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Grep is not a Python command. If you're sticking with Python, and not bash commands, here's a quick strip to get you started:

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
        for line in f:
                # find FASTA headers. 
                if line.startswith(">"):
                        # check if 'location' in header
                        if 'location' in line:
                                # split header by spaces into list
                                x = line.strip().split(' ')
                                # for each item in header check if 'location' is in that item
                                for i in x:
                                        if 'location' in i:
                                                print i

Prints:

[location=(207..914)]

save as find_loc.py, run as python find_loc.py myfile.fasta > locations.txt

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by st.ph.n2.5k
1
gravatar for cpad0112
2.9 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:
>>> import re
>>> import os
>>> with open ("test.txt","r") as t:
    f=t.read()
>>> pattern=re.compile('\[location=\([0-9]+..[0-9]+\)\]')
>>> re.findall(pattern, f)

output:

===========================

['[location=(207..914)]',
 '[location=(2070..9140)]',
 '[location=(20700..91400)]',
 '[location=(207000..914000)]']

==============================

>>> print (f)

=====================================

output:

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207000..914000)] [gbkey=CDS]

======================================

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by cpad011214k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1474 users visited in the last hour