Question: removing lines that don't match by grep
0
gravatar for vinayjrao
14 months ago by
vinayjrao110
JNCASR, India
vinayjrao110 wrote:

Hello, I have a file containing gene names of interest (24423 genes), and another file containing the lengths to all the genes (41306 genes). I want the lengths only to the 24424 genes, but when I grep using grep -wf file1 file2 or even fgrep -Fwf file1 file2, I get some excess genes, because some genes in my list may contain only the sense or the anti-sense strands, whereas if the reference file may contain both, and that is being reflected.

I want to know if there is a way to remove from the reference file (file2) all the lines that don't match?

Thank you.

P.S. The question is also on stackoverflow.com

edit -

file1

A1BG

A1BG-AS1

TSPAN6

MYB

MYB-AS1

file2

A1BG      2941

A1BG-AS1      560

TSPAN6      7923

MYB-AS1      362

MYB-AS2      713

MYB-AS3      396

desired_output

A1BG      2941

A1BG-AS1      560

TSPAN6      7923

MYB-AS1      362

But I always get MYB-AS2 and MYB-AS3

file handling grep • 507 views
ADD COMMENTlink modified 14 months ago by michael.ante3.2k • written 14 months ago by vinayjrao110

and you'll soon get some negative votes on stackoverflow because you don't show any sample of your files.

ADD REPLYlink written 14 months ago by Pierre Lindenbaum119k

Hi, can you post example of your file1, file2 and desire output?

ADD REPLYlink written 14 months ago by Paul1.3k
1
gravatar for Paul
14 months ago by
Paul1.3k
European Union
Paul1.3k wrote:

Hi, what about awk solution:

awk 'FNR==NR {a[$1]; next} $1 in a' file1 file2

Desire output:

A1BG    2941
A1BG-AS1    560
TSPAN6  7923
MYB-AS1 362
ADD COMMENTlink written 14 months ago by Paul1.3k
2
gravatar for michael.ante
14 months ago by
michael.ante3.2k
Austria/Vienna
michael.ante3.2k wrote:

Hi a simple join would be sufficient:

join file1 file2
ADD COMMENTlink written 14 months ago by michael.ante3.2k

I tried this solution too, but I did not get the desired result. It gave me lesser number of genes as compared to the awk output. How exactly does it work?

ADD REPLYlink written 14 months ago by vinayjrao110
3

It compares the first column of both files. Both files should be in the same order. If they are not, you'll need to sort them : join <(sort file1) <(sort -k1,1 file2)

[EDIT] It works with your example data

ADD REPLYlink modified 14 months ago • written 14 months ago by michael.ante3.2k

this is the correct answer.

ADD REPLYlink written 14 months ago by Pierre Lindenbaum119k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 694 users visited in the last hour