Question

code for text file parsing

0

Entering edit mode

7.2 years ago

sharmatina189059 ▴ 110

Dear all can anyone tell me how to parse this type of file using either perl or awk

AUO97_RS0005
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

I want to retrieve only only those values having data in next columns and remove duplicates. Output file:

AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

gene • 2.0k views

ADD COMMENT • link updated 7.2 years ago by cpad0112 21k • written 7.2 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

Your formatting makes any effort on our side impossible. Please put the example into code blocks to preserve new lines and other formatting.

ADD REPLY • link 7.2 years ago by kloetzl ★ 1.1k

0

Entering edit mode

AUO97_RS0005 
AUO97_RS0005 alpha hydrolase wp_567465 GI:54365463 
AUO97_RS0007 
AUO97_RS0007 beta hydrolase wp_567465 GI:65456475 
AUO97_RS0020
AUO97_RS0020 AUO97_RS0020 gamma hydrolase wp_567465 GI:4536473

ADD REPLY • link updated 7.2 years ago by GenoMax 147k • written 7.2 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

Your question is unanswerable as written.

ADD REPLY • link 7.2 years ago by Alex Reynolds 35k

0

Entering edit mode

file is like this:

AUO97_RS0005                
AUO97_RS0005    alpha   hydrolase   wp_567465   GI:54365463
AUO97_RS0007                
AUO97_RS0007    beta    hydrolase   wp_567465   GI:65456475
AUO97_RS0020                
AUO97_RS0020    gamma   hydrolase   wp_567465   GI:4536473

ADD REPLY • link updated 7.2 years ago by GenoMax 147k • written 7.2 years ago by sharmatina189059 ▴ 110

score 2 · Answer 1 · 2017-08-27

2

Entering edit mode

7.2 years ago

EagleEye 7.6k

How is the second line from above example separated ( space or TAB ) ?

If TAB delimited, you can extract the lines containing TAB delimited entries.

grep -P "\t" YOUR_FILE.txt > DESIRED_OUTPUT.txt

If the single column entries in the above example also has TAB in the end,

grep -P "\t.*\t" YOUR_FILE.txt > DESIRED_OUTPUT.txt

ADD COMMENT • link 7.2 years ago by EagleEye 7.6k

0

Entering edit mode

It is tab separated but i have done it using awk. awk -F '\t' '$2!=""' file2. but thank you so much for your reply.

ADD REPLY • link 7.2 years ago by sharmatina189059 ▴ 110

score 2 · Answer 2 · 2017-08-28

2

Entering edit mode

7.2 years ago

Alex Reynolds 35k

I suspect this will generally be fastest on the larger inputs, by avoiding cat and sort and uniq operations:

$ awk 'NF > 1 && !a[$0]++' in.txt > answer.txt

ADD COMMENT • link 7.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Could you please explain what

&& !a[$0]++

does?

ADD REPLY • link 7.2 years ago by e.rempel ★ 1.1k

0

Entering edit mode

It strips duplicate lines without needing to sort the input. It can be memory hungry, but memory is cheap and fast, these days. If you can do without sorting, go for it, I say.

ADD REPLY • link 7.2 years ago by Alex Reynolds 35k

score 1 · Answer 3 · 2017-08-28

Full bash solution, just for the sake of it!

#!/bin/bash
# Usage: bash scriptname.sh textfile.tsv

while read line ; do
    read -a fields <<< "$line"
    [ ! -z "${fields[1]}" ] && echo "$line"
done < $1

Or as a one-liner:

while read line ; do read -a fields <<< "$line" ; [ ! -z "${fields[1]}" ] && echo "$line" ; done < textfile.tsv

score 0 · Answer 4 · 2017-08-28

0

Entering edit mode

7.2 years ago

kloetzl ★ 1.1k

cat file.txt | awk 'NF > 1' | sort | uniq

ADD COMMENT • link 7.2 years ago by kloetzl ★ 1.1k

score 0 · Answer 5 · 2017-08-28

$ cat test.txt 
AUO97_RS0005
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

ouput: awk solution

$ awk '$2 == "" {next} {print}' test.txt
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

non-awk solution:

$ grep -i GI test.txt 
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473