Question

using a loop to extract data from a text file and outputting it as a new file

0

Entering edit mode

20 months ago

matt81rd ▴ 10

Hi i need to extract a certain portion from a file i have just created and output it to a new file. I need to be able to loop through the file as there are many data points i need to extract.

I need to extract the information under the name column: 916830_H20130029501-2. I know sed, awk or grep are probably the best ways to do this but am unsure of what the pattern would look like due to the nature of the input file below:

H194880489
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 745 | 882730_H19488048901-2 | 638 | 597 | 325 | 300 | 153 |   93 |   93
 715 | 850922_H19488048901-2 | 638 | 597 | 325 | 300 | 153 |   93 |   93
(2 rows)

H194660490
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 709 | 842927_H19466049001-2 | 632 | 592 | 559 | 233 |   6 |    6 |    6
(1 row)

H194620465
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 707 | 841499_H19462046501-1 | 630 | 590 | 557 | 486 | 378 |  186 |   68
(1 row)

H194420367
 id  |          name           | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-------------------------+-----+-----+-----+-----+-----+------+------
 703 | 833390_H19442036701-2   | 626 | 587 | 555 | 484 | 312 |   36 |   19
 739 | 882806_H19442036703-2   | 653 | 587 | 555 | 484 | 312 |   36 |   19
 756 | 882806_H19442036703_v-1 | 653 | 587 | 555 | 484 | 312 |   36 |   19
(3 rows)

As you can see sometimes there are data points sometimes with no information to extract and sometimes they have two or even three data points under name and i only need the first one.

The format of the output would look something like this:

882730_H19488048901-2
842927_H19466049001-2
841499_H19462046501-1
833390_H19442036701-2

Any help will be greatly appreciated :)

sed grep awk • 600 views

ADD COMMENT • link updated 20 months ago by Joe 21k • written 20 months ago by matt81rd ▴ 10

1

Entering edit mode

$ awk 'FNR == 3' file.txt | sed 's/ //g' | cut -d '|' -f 2

only print the 3rd row
remove all blanks so we can use cut
cut.

ADD REPLY • link 20 months ago by shenwei356 8.4k

0

Entering edit mode

Are the files tab-delimited or in format with |, -, and +?

from a file i have just created

Since you create the files, you can easily output to any other formats including what you want, the first appeared names.

ADD REPLY • link 20 months ago by shenwei356 8.4k

score 0 · Answer 1 · 2022-08-25

0

Entering edit mode

20 months ago

Joe 21k

I'd suggest not to use sed etc for this task.

You could apply the approach here and robustly tabulate the whole file with pandas.read_fwf:

https://github.com/jrjhealey/bioinfo-tools/blob/master/tabulateHHpred.py#L41-L55

ADD COMMENT • link 20 months ago by Joe 21k