Parsing complex file for extraction of number range
2
0
Entering edit mode
7.5 years ago
User 6777 ▴ 20

I have a large file with tab separated three data columns (and some repetitive header lines) as:

Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...

# ../Output/Split_Seq/NP_416485.4.fasta - gap penalty: 1 - normalized: False
# align_column_number   score   column

0   0.66627 ------MMMMM
1   -1000.00000 -----S-GGGG
2   0.66627 --MMMF-FFFC
3   0.71962 MMAAAF-CYYY
4   0.43673 SSTTTN-TAAT
5   -1000.00000 HRKKKT-GRRR
6   0.61010 YFKKKL-TTTT
7   0.75691 K-RRRT-RRRR
8   0.63134 T-SSSV-HHHH
Sequence ../Output/yy\Programs\YP_026226.4 alignment. Using default output format....

# ../Output/Split_Seq/YP_026226.4.fasta - gap penalty: 1 - normalized: False
# align_column_number   score   column

0   0.91889 MMMMMM
1   0.85379 RRRRRR
2   0.55095 -YTTTH
3   -1000.00000 -L---A
4   -1000.00000 -A---F
5   -1000.00000 AG---L
6   -1000.00000 IM---P
7   -1000.00000 -----A

From the second data column(i.e., score), for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).

For the above Input, the output would be:

NP_416485.4: 1, 3-4, 7-9
YP_026226.4: 1-3

Here, "NP_416485.4" and "YP_026226.4" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-3, 6-8", but I increases all of them with +1 as I dont want to start with 0).

Please help me. How can I generate the desired output? Thanks.

perl python • 1.8k views
ADD COMMENT
1
Entering edit mode
7.5 years ago
Eric Lim ★ 2.1k

Why exactly do you need the output to be in the proposed format?

with open('test.txt', 'r') as fin:
  reader = csv.reader(filter(lambda row: row[0]!='#', fin), delimiter='\t')
  lines = [int(l) for l,s,t in reader if float(s) > 0.5]
    for k, g in itertools.groupby(enumerate(lines), lambda x:x[0]-x[1]):
      group = list(map(operator.itemgetter(1), g))
      print(group)

The code snippet above doesn't fully complete what you asked, but it should put you in the right direction.

ADD COMMENT
0
Entering edit mode

Thanks khericlim, to start with, I have used python csv module as:

import csv

with open('test.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
 csvout = csv.writer(csvout)

    for row in tsvin:
        count = float(row[1])
        if count > 0.5:
            csvout.writerows([row[0:1] for _ in xrange(count)])

but it gives:

csvout.writerows([row [0:1] for _ in xrange(count)])
TypeError: integer argument expected, got float

Please help. Thanks.

ADD REPLY
0
Entering edit mode

xrange, like range, takes integers, but you're giving floats. See here.

ADD REPLY
0
Entering edit mode
7.5 years ago
zhangz.sci • 0

Do you have any programming experience? If not, you should ask a college to give you a script to do what you want.

ADD COMMENT

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6