Question

Parsing GLOBPROT flat text file format

0

Entering edit mode

7.7 years ago

User 6777 ▴ 20

Hi all, I have a IUPRED-GLOBPROT flat text result file as my input. Part of the file is like:

# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# NC_179987
Number of globular domains:     1 
          globular domain       1.    1 - 112 
>NC_179987
MSSKQEISKK IISLLNTLPK EKLKHYSSFK DSQIKRFSDL QKVNQISEQD LKLQYIALKN
LCNDKYKRYY ELDDKLLRPK GNPHYYERLM NEINGEKKEN LFSALRTVVF GK
# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# 68476204
Number of globular domains:     0 
>68476204
dledaydkfa iydkvdngsg geeqqpeldp nvnynevtde epseeessed ssddffedep
pkkd
# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# 684723624
Number of globular domains:     3 
          globular domain       1.    267 - 307 
          globular domain       2.    765 - 829 
          globular domain       3.    1141 - 1197 
>684723624
msetkeapkp tkqesqgilk kltsgdtwvs pfrsqaseed pkkkinlykq fkesnkiehi
kv..
# Copyright (c) Zsuzsanna Dosztanyi, 2005
...
...

From this, I want to parse the 'Start-End' positions in lines start with "globular domain" of each refseq/gi id (located below the 'globular domain' line or above the 'Number of globular domains:' line). For the above input, the output will be:

NC_179987: 1 - 112
684723624: 267 - 307, 765 - 829, 1141 - 1197

I have tried:

with open("input.txt") as f:
    first_time = True
    for line in f:
        line = line.rstrip()
        if line.startswith(">"):
            if not first_time:
                if start_ends:
                    print("{}: {}".format(header,", ".join(start_ends)))        
            else:
                first_time = False    
            header = line.lstrip(">")
            start_ends = []
        elif len(line.split()) == 6 and "".join(line.split()[3:]).isnumeric():
            start_ends.append("{}-{}".format(line.split()[3],line.split()[5]))
    if start_ends:
        print("{}: {}".format(header,", ".join(start_ends)))

But I could not get any output.

python • 1.7k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 7.7 years ago by User 6777 ▴ 20

0

Entering edit mode

Is this a different question than the one you just got an answer for or are you trying to come up with a python solution for the same problem?

ADD REPLY • link 7.7 years ago by GenoMax 141k

0

Entering edit mode

thanks for reply .. its a different file generated from iupred globprot result. Previously, the output generated from different program. I have tried in python, but this script yields no output.

ADD REPLY • link 7.7 years ago by User 6777 ▴ 20

0

Entering edit mode

I originally only looked at the expected output but I see the difference now.

ADD REPLY • link 7.7 years ago by GenoMax 141k

score 2 · Accepted Answer · 2016-08-30

Hope this helps:

with open("dat.txt") as f:
    lis, dic =[], {}
    for line in f:
        line = line.rstrip()
        if line.startswith('>') and len(lis)>=1:
            dic[line.lstrip(">")] = lis
            print(": ".join([line.lstrip(">"),", ".join(lis)]))
            lis=[]
        if 'domain' in line and ':' not in line:
            lis.append(line.split(".")[-1].lstrip(" "))
print(dic)  
NC_179987 : 1 - 112
684723624 : 267 - 307, 765 - 829, 1141 - 1197
{'684723624': ['267 - 307', '765 - 829', '1141 - 1197'], 'NC_179987': ['1 - 112']}

I'm also appending to a dictionary for downstream processing, if you need!