Question

Python: grep the strings whiitn "[ ]"

0

Entering edit mode

6.3 years ago

horsedog ▴ 60

Hi ,I'm using bioinformatics tool parsing my sequences, here I'd like to extract some information i need. There are thousands of query names corresponding to different sequences, like this

  >lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

What I need is "[location=(207..914)]" ; How I can achieve this? In different sequences the name would be different, I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one, and sometimes there is no "locations", meaning no cds in this sequence so just give it a miss. I'm thinking to use "grep" or "re.search" but it didn't work:

for line in open(file,"r").readlines():   
  if "location=" in line:  
    cds = grep “[location = *]” line  
  print(cds)

Does anyone have idea?
Many thanks!

python • 1.5k views

ADD COMMENT • link updated 6.3 years ago by cpad0112 21k • written 6.3 years ago by horsedog ▴ 60

0

Entering edit mode

If you want to stick with re then

for line in open("test", "r").readlines():
        if "location" in line:
                loc = re.split(r" ", line)
                for m in loc:
                        if "location" in m:
                                print(m)

ADD REPLY • link 6.3 years ago by GenoMax 141k

0

Entering edit mode

good one:). Further shortening the code:

import re
for line in open("test.txt", "r").readlines():
    if "location" in line:
        print(line.split()[5])

output:

[location=(207..914)]
[location=(2070..9140)]
[location=(20700..91400)]

input:

$ cat test.txt
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]
>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]

ADD REPLY • link 6.3 years ago by cpad0112 21k

0

Entering edit mode

I tried to use "split" by space to take the fifth element, but in some cases the "location" is not the fifth one

otherwise: grep -e 'location' myfile.fasta | cut -f 6 -d ' ' > locations.txt

ADD REPLY • link 6.3 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2018-01-03

Grep is not a Python command. If you're sticking with Python, and not bash commands, here's a quick strip to get you started:

#!/usr/bin/env python
import sys
with open(sys.argv[1], 'r') as f:
        for line in f:
                # find FASTA headers. 
                if line.startswith(">"):
                        # check if 'location' in header
                        if 'location' in line:
                                # split header by spaces into list
                                x = line.strip().split(' ')
                                # for each item in header check if 'location' is in that item
                                for i in x:
                                        if 'location' in i:
                                                print i

Prints:

[location=(207..914)]

save as find_loc.py, run as python find_loc.py myfile.fasta > locations.txt

score 1 · Answer 2 · 2018-01-03

>>> import re
>>> import os
>>> with open ("test.txt","r") as t:
    f=t.read()
>>> pattern=re.compile('\[location=\([0-9]+..[0-9]+\)\]')
>>> re.findall(pattern, f)

output:

===========================

['[location=(207..914)]',
 '[location=(2070..9140)]',
 '[location=(20700..91400)]',
 '[location=(207000..914000)]']

==============================

>>> print (f)

=====================================

output:

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207..914)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(2070..9140)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(20700..91400)] [gbkey=CDS]

>lcl|NZ_FOCX01000120.1_cds_WP_092665365.1_1 [locus_tag=BMX17_RS24940] [protein=hypothetical protein] [protein_id=WP_092665365.1] [location=(207000..914000)] [gbkey=CDS]

======================================