Parsing GLOBPROT flat text file format
1
0
Entering edit mode
7.7 years ago
User 6777 ▴ 20

Hi all, I have a IUPRED-GLOBPROT flat text result file as my input. Part of the file is like:

# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# NC_179987
Number of globular domains:     1 
          globular domain       1.    1 - 112 
>NC_179987
MSSKQEISKK IISLLNTLPK EKLKHYSSFK DSQIKRFSDL QKVNQISEQD LKLQYIALKN
LCNDKYKRYY ELDDKLLRPK GNPHYYERLM NEINGEKKEN LFSALRTVVF GK
# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# 68476204
Number of globular domains:     0 
>68476204
dledaydkfa iydkvdngsg geeqqpeldp nvnynevtde epseeessed ssddffedep
pkkd
# IUPred 
# Copyright (c) Zsuzsanna Dosztanyi, 2005
#
# Z. Dosztanyi, V. Csizmok, P. Tompa and I. Simon
# J. Mol. Biol. (2005) 347, 827-839. 
#
#
# Prediction output 
# 684723624
Number of globular domains:     3 
          globular domain       1.    267 - 307 
          globular domain       2.    765 - 829 
          globular domain       3.    1141 - 1197 
>684723624
msetkeapkp tkqesqgilk kltsgdtwvs pfrsqaseed pkkkinlykq fkesnkiehi
kv..
# Copyright (c) Zsuzsanna Dosztanyi, 2005
...
...

From this, I want to parse the 'Start-End' positions in lines start with "globular domain" of each refseq/gi id (located below the 'globular domain' line or above the 'Number of globular domains:' line). For the above input, the output will be:

NC_179987: 1 - 112
684723624: 267 - 307, 765 - 829, 1141 - 1197

I have tried:

with open("input.txt") as f:
    first_time = True
    for line in f:
        line = line.rstrip()
        if line.startswith(">"):
            if not first_time:
                if start_ends:
                    print("{}: {}".format(header,", ".join(start_ends)))        
            else:
                first_time = False    
            header = line.lstrip(">")
            start_ends = []
        elif len(line.split()) == 6 and "".join(line.split()[3:]).isnumeric():
            start_ends.append("{}-{}".format(line.split()[3],line.split()[5]))
    if start_ends:
        print("{}: {}".format(header,", ".join(start_ends)))

But I could not get any output.

python • 1.7k views
ADD COMMENT
0
Entering edit mode

Is this a different question than the one you just got an answer for or are you trying to come up with a python solution for the same problem?

ADD REPLY
0
Entering edit mode

thanks for reply .. its a different file generated from iupred globprot result. Previously, the output generated from different program. I have tried in python, but this script yields no output.

ADD REPLY
0
Entering edit mode

I originally only looked at the expected output but I see the difference now.

ADD REPLY
2
Entering edit mode
7.7 years ago
second_exon ▴ 210

Hope this helps:

with open("dat.txt") as f:
    lis, dic =[], {}
    for line in f:
        line = line.rstrip()
        if line.startswith('>') and len(lis)>=1:
            dic[line.lstrip(">")] = lis
            print(": ".join([line.lstrip(">"),", ".join(lis)]))
            lis=[]
        if 'domain' in line and ':' not in line:
            lis.append(line.split(".")[-1].lstrip(" "))
print(dic)  
NC_179987 : 1 - 112
684723624 : 267 - 307, 765 - 829, 1141 - 1197
{'684723624': ['267 - 307', '765 - 829', '1141 - 1197'], 'NC_179987': ['1 - 112']}

I'm also appending to a dictionary for downstream processing, if you need!

ADD COMMENT
0
Entering edit mode

Thank you very much Sir..

ADD REPLY

Login before adding your answer.

Traffic: 1892 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6