Question

Parsing complex file for extraction of number range

0

Entering edit mode

7.5 years ago

User 6777 ▴ 20

I have a large file with tab separated three data columns (and some repetitive header lines) as:

Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...

# ../Output/Split_Seq/NP_416485.4.fasta - gap penalty: 1 - normalized: False
# align_column_number   score   column

0   0.66627 ------MMMMM
1   -1000.00000 -----S-GGGG
2   0.66627 --MMMF-FFFC
3   0.71962 MMAAAF-CYYY
4   0.43673 SSTTTN-TAAT
5   -1000.00000 HRKKKT-GRRR
6   0.61010 YFKKKL-TTTT
7   0.75691 K-RRRT-RRRR
8   0.63134 T-SSSV-HHHH
Sequence ../Output/yy\Programs\YP_026226.4 alignment. Using default output format....

# ../Output/Split_Seq/YP_026226.4.fasta - gap penalty: 1 - normalized: False
# align_column_number   score   column

0   0.91889 MMMMMM
1   0.85379 RRRRRR
2   0.55095 -YTTTH
3   -1000.00000 -L---A
4   -1000.00000 -A---F
5   -1000.00000 AG---L
6   -1000.00000 IM---P
7   -1000.00000 -----A

From the second data column(i.e., score), for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).

For the above Input, the output would be:

NP_416485.4: 1, 3-4, 7-9
YP_026226.4: 1-3

Here, "NP_416485.4" and "YP_026226.4" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-3, 6-8", but I increases all of them with +1 as I dont want to start with 0).

Please help me. How can I generate the desired output? Thanks.

perl python • 1.8k views

ADD COMMENT • link updated 7.5 years ago by Eric Lim ★ 2.1k • written 7.5 years ago by User 6777 ▴ 20

score 1 · Answer 1 · 2016-10-28

1

Entering edit mode

7.5 years ago

Eric Lim ★ 2.1k

Why exactly do you need the output to be in the proposed format?

with open('test.txt', 'r') as fin:
  reader = csv.reader(filter(lambda row: row[0]!='#', fin), delimiter='\t')
  lines = [int(l) for l,s,t in reader if float(s) > 0.5]
    for k, g in itertools.groupby(enumerate(lines), lambda x:x[0]-x[1]):
      group = list(map(operator.itemgetter(1), g))
      print(group)

The code snippet above doesn't fully complete what you asked, but it should put you in the right direction.

ADD COMMENT • link 7.5 years ago by Eric Lim ★ 2.1k

0

Entering edit mode

Thanks khericlim, to start with, I have used python csv module as:

import csv

with open('test.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
 csvout = csv.writer(csvout)

    for row in tsvin:
        count = float(row[1])
        if count > 0.5:
            csvout.writerows([row[0:1] for _ in xrange(count)])

but it gives:

csvout.writerows([row [0:1] for _ in xrange(count)])
TypeError: integer argument expected, got float

Please help. Thanks.

ADD REPLY • link 7.5 years ago by User 6777 ▴ 20

0

Entering edit mode

xrange, like range, takes integers, but you're giving floats. See here.

ADD REPLY • link 7.5 years ago by Eric Lim ★ 2.1k

score 0 · Answer 2 · 2016-10-28

0

Entering edit mode

7.5 years ago

zhangz.sci • 0

Do you have any programming experience? If not, you should ask a college to give you a script to do what you want.

ADD COMMENT • link 7.5 years ago by zhangz.sci • 0