removing lines that don't match by grep
2
0
Entering edit mode
3.7 years ago
vinayjrao ▴ 200

Hello, I have a file containing gene names of interest (24423 genes), and another file containing the lengths to all the genes (41306 genes). I want the lengths only to the 24424 genes, but when I grep using grep -wf file1 file2 or even fgrep -Fwf file1 file2, I get some excess genes, because some genes in my list may contain only the sense or the anti-sense strands, whereas if the reference file may contain both, and that is being reflected.

I want to know if there is a way to remove from the reference file (file2) all the lines that don't match?

Thank you.

P.S. The question is also on stackoverflow.com

edit -

file1

A1BG

A1BG-AS1

TSPAN6

MYB

MYB-AS1

file2

A1BG      2941

A1BG-AS1      560

TSPAN6      7923

MYB-AS1      362

MYB-AS2      713

MYB-AS3      396

desired_output

A1BG      2941

A1BG-AS1      560

TSPAN6      7923

MYB-AS1      362

But I always get MYB-AS2 and MYB-AS3

grep file handling • 1.1k views
ADD COMMENT
0
Entering edit mode

and you'll soon get some negative votes on stackoverflow because you don't show any sample of your files.

ADD REPLY
0
Entering edit mode

Hi, can you post example of your file1, file2 and desire output?

ADD REPLY
1
Entering edit mode
3.7 years ago
Paul ★ 1.4k

Hi, what about awk solution:

awk 'FNR==NR {a[$1]; next} $1 in a' file1 file2

Desire output:

A1BG    2941
A1BG-AS1    560
TSPAN6  7923
MYB-AS1 362
ADD COMMENT
2
Entering edit mode
3.7 years ago
michael.ante ★ 3.7k

Hi a simple join would be sufficient:

join file1 file2
ADD COMMENT
0
Entering edit mode

I tried this solution too, but I did not get the desired result. It gave me lesser number of genes as compared to the awk output. How exactly does it work?

ADD REPLY
3
Entering edit mode

It compares the first column of both files. Both files should be in the same order. If they are not, you'll need to sort them : join <(sort file1) <(sort -k1,1 file2)

[EDIT] It works with your example data

ADD REPLY
0
Entering edit mode

this is the correct answer.

ADD REPLY

Login before adding your answer.

Traffic: 2905 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6