code for text file parsing
6
0
Entering edit mode
7.3 years ago

Dear all can anyone tell me how to parse this type of file using either perl or awk

AUO97_RS0005
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

I want to retrieve only only those values having data in next columns and remove duplicates. Output file:

AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473
gene • 2.0k views
ADD COMMENT
0
Entering edit mode

Your formatting makes any effort on our side impossible. Please put the example into code blocks to preserve new lines and other formatting.

ADD REPLY
0
Entering edit mode
AUO97_RS0005 
AUO97_RS0005 alpha hydrolase wp_567465 GI:54365463 
AUO97_RS0007 
AUO97_RS0007 beta hydrolase wp_567465 GI:65456475 
AUO97_RS0020
AUO97_RS0020 AUO97_RS0020 gamma hydrolase wp_567465 GI:4536473
ADD REPLY
0
Entering edit mode

Your question is unanswerable as written.

ADD REPLY
0
Entering edit mode

file is like this:

AUO97_RS0005                
AUO97_RS0005    alpha   hydrolase   wp_567465   GI:54365463
AUO97_RS0007                
AUO97_RS0007    beta    hydrolase   wp_567465   GI:65456475
AUO97_RS0020                
AUO97_RS0020    gamma   hydrolase   wp_567465   GI:4536473
ADD REPLY
2
Entering edit mode
7.3 years ago
EagleEye 7.6k

How is the second line from above example separated ( space or TAB ) ?

If TAB delimited, you can extract the lines containing TAB delimited entries.

grep -P "\t" YOUR_FILE.txt > DESIRED_OUTPUT.txt

If the single column entries in the above example also has TAB in the end,

grep -P "\t.*\t" YOUR_FILE.txt > DESIRED_OUTPUT.txt
ADD COMMENT
0
Entering edit mode

It is tab separated but i have done it using awk. awk -F '\t' '$2!=""' file2. but thank you so much for your reply.

ADD REPLY
2
Entering edit mode
7.3 years ago

I suspect this will generally be fastest on the larger inputs, by avoiding cat and sort and uniq operations:

$ awk 'NF > 1 && !a[$0]++' in.txt > answer.txt
ADD COMMENT
0
Entering edit mode

Could you please explain what

&& !a[$0]++

does?

ADD REPLY
0
Entering edit mode

It strips duplicate lines without needing to sort the input. It can be memory hungry, but memory is cheap and fast, these days. If you can do without sorting, go for it, I say.

ADD REPLY
1
Entering edit mode
7.3 years ago
Joe 21k

Full bash solution, just for the sake of it!

#!/bin/bash
# Usage: bash scriptname.sh textfile.tsv

while read line ; do
    read -a fields <<< "$line"
    [ ! -z "${fields[1]}" ] && echo "$line"
done < $1

Or as a one-liner:

while read line ; do read -a fields <<< "$line" ; [ ! -z "${fields[1]}" ] && echo "$line" ; done < textfile.tsv
ADD COMMENT
0
Entering edit mode
7.3 years ago
kloetzl ★ 1.1k
cat file.txt | awk 'NF > 1' | sort | uniq
ADD COMMENT
0
Entering edit mode
7.3 years ago
$ cat test.txt 
AUO97_RS0005
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

ouput: awk solution

$ awk '$2 == "" {next} {print}' test.txt
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473

non-awk solution:

$ grep -i GI test.txt 
AUO97_RS0005    alpha  hydrolase wp_567465 GI:54365463
AUO97_RS0007    beta   hydrolase wp_567465 GI:65456475
AUO97_RS0020    gamma   hydrolase wp_567465 GI:4536473
ADD COMMENT
0
Entering edit mode

A similar solution has been used/posted by OP already.

ADD REPLY

Login before adding your answer.

Traffic: 2367 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6