Question: GREP: out of memory
0
gravatar for jaqx008
4 months ago by
jaqx00840
jaqx00840 wrote:

Hello all, I have some .gff file that I am running grep on. for some of the files, grep runs ok but for others, it spits out the error message "grep: out of memory". it seem to be a common problem with grep but I havent really seen a solution that helps either on this forum or elsewhere. Beside I dont know much about memory management but my machine has 128GB and I dont know why I am getting this error. Help. Thanks in adavance. (my command is bellow and it runs fine for some of the files)

grep -f 1.txt 2.gff > 3.gff
mac os deseq2 grep heloci .gff • 775 views
ADD COMMENTlink modified 4 months ago by Alex Reynolds26k • written 4 months ago by jaqx00840
1

grep -f can indeed consume a ton of memory. It's nice that you received the "out of memory" error, I have crashed a 500Gb RAM server with a nasty grep -f.

ADD REPLYlink written 4 months ago by WouterDeCoster34k
1

jaqx008 : Are you grep'ing something super secret? Can you provide an actual example so we can avoid this endless back and forth? As @ATPoint said there may be an efficient way of doing what ever you are trying to do without using grep.

ADD REPLYlink modified 4 months ago • written 4 months ago by genomax58k

This is what I am trying to do. there are some gene IDs in a text file A, and lines possibly containing the gene IDs in gff file B. I am trying to identify the matches in B and outpur all the lines that match in B ( this should output all the columns in B. A has only one column of gene IDs B has has multiple colums and one of those columns has the gene IDs. so my command should pull out the corresponding columns of A in B. bellow is the command

$ grep -f A.txt B.gff > AinB.gff

 Is this clear enough? and is there another way to go about this
 without grep? Thanks
ADD REPLYlink modified 4 months ago • written 4 months ago by jaqx00840

Can you give us an idea of the numbers we're talking about here? file sizes? list length ?

For most cases a solution like offered by arup (== loop through your list and grep each of them) will solve your issue

ADD REPLYlink written 4 months ago by lieven.sterck3.1k

OK. the text files rang between 100bytes to about 50kb with word count around 1500 and in this format

2597857964
2597857966
2597857965
2597857963
2597860200
2597857153
2597857472

while the .gff range from about 600kb to about 900kb with wc of about 5000 to 7000 in the format bellow

gff-version 3

Ga0036900_gi400319889.1 img_core_v400   CDS 109 1620    .   +   0   ID=2597849326;locus_tag=Ga0036900_00001;product=chromosomal replication initiator protein DnaA
Ga0036900_gi400319889.1 img_core_v400   CDS 1660    2763    .   +   0   ID=2597849327;locus_tag=Ga0036900_00002;product=DNA polymerase III, beta subunit
Ga0036900_gi400319889.1 img_core_v400   CDS 2784    3887    .   +   0   ID=2597849328;locus_tag=Ga0036900_00003;product=DNA replication and repair protein RecF
Ga0036900_gi400319889.1 img_core_v400   CDS 3893    6310    .   +   0   ID=2597849329;locus_tag=Ga0036900_00004;product=DNA gyrase subunit B

Also, I did tried the loop but it exits with the error

grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
ADD REPLYlink modified 4 months ago • written 4 months ago by jaqx00840
1

Where did you get your files from, which operating system? Did you open them in windows or so? I had to look myself but apparently this 'error' is related to the encoding of your file.

If you have dos2unix (or mac2unix) installed, you might run that first on your files to convert them to proper unix encoding

ADD REPLYlink written 4 months ago by lieven.sterck3.1k

What is output of ulimit -a?

ADD REPLYlink written 4 months ago by genomax58k

oh I see. well ulimit -a output

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 256
pipe size            (512 bytes, -p) 1
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 709
virtual memory          (kbytes, -v) unlimited
ADD REPLYlink written 4 months ago by jaqx00840

Your account does not have a limit. So the error is due to possibilities enumerated by others.

ADD REPLYlink written 4 months ago by genomax58k

What are you grep-ing? Maybe a more efficient way could be the use of tabix, depending on what you want to retrieve.

ADD REPLYlink written 4 months ago by ATpoint9.2k

well I cant post more than 5 times in 6 hours and thats why I havent responded. BTW I am grep-ing a text file like

2345
3245
5432
1234

against a .gff file like

4634 - ID=2345 4353 + ID=3245 etc its working for some and giving memory complain for others.

ADD REPLYlink modified 4 months ago • written 4 months ago by jaqx00840

Those patterns are too generic and likely generate many matches. That must be the reason why you are running out of memory. Are you trying to pull out specific genes?

Once you post for a certain number of times (get rep points) that posting limit should go up.

ADD REPLYlink written 4 months ago by genomax58k

Yes I am trying to pull out certain genes IDs with there corresponding information. I posted in the comment above what my files look like. the thing is, it works fine for some and does not for others.

ADD REPLYlink written 4 months ago by jaqx00840

This section from GNU-parallel may help you in automatic piping of both the files or single file: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep. Manual covers each situation (eg. limited RAM, limited CPU etc)

ADD REPLYlink written 4 months ago by cpad01129.9k
0
gravatar for jaqx008
4 months ago by
jaqx00840
jaqx00840 wrote:

I was able to resolve this issue. If anyone else runs into the same problem, the command bellow should help. first split files into few lines using

split -l n file.txt

where n= number of lines desired (I did n = 5)

$ for i in x*; do grep -f "$i" file2.gff > "$i".tmp.gff; done && cat *tmp.gff > output.gff
ADD COMMENTlink modified 4 months ago by genomax58k • written 4 months ago by jaqx00840
2
gravatar for arup
4 months ago by
arup640
India
arup640 wrote:

Rather than running at once pass the 1.txt entries one by one.

while read line; do grep -wF "$line" 2.gff; done < 1.txt

Source: https://askubuntu.com/questions/595426/use-a-list-of-words-to-grep-in-an-other-list

ADD COMMENTlink modified 4 months ago • written 4 months ago by arup640
3

would personally add -m1 to this as it then does not unnecessarily keeps on looking through the file over and over again

-m, --max-count=NUM stop after NUM matches

ADD REPLYlink written 4 months ago by lieven.sterck3.1k

Hey, Where in the command would you put -m? I mean where in

while read line; do grep -wF "$line" 2.gff; done < 1.txt
ADD REPLYlink written 4 months ago by jaqx00840

before the -wF part.

I must add that I mentioned this in thinking you only needed one line per grep but that's likely not the case. So it might be a bit tricky to use the -m option

ADD REPLYlink written 4 months ago by lieven.sterck3.1k

It gives a long list that looks like this then terminates aftyer I run the command

grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: output.txt: No such file or directory
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
ADD REPLYlink written 4 months ago by jaqx00840

are you grepping a pattern with spaces in it?

ADD REPLYlink written 4 months ago by lieven.sterck3.1k
1
gravatar for lieven.sterck
4 months ago by
lieven.sterck3.1k
VIB, Ghent, Belgium
lieven.sterck3.1k wrote:

I'm no sys-admin myself but I believe there is (or can be) a limit on the amount of mem a command is allowed to use, regardless of the total amount of mem of the server (this to avoid one command killing the server).

I see you're doing a grep -f , so what you can do is to split up your 1.txt file in several parts and run them one of the other and append the result to the 3.gff file

ADD COMMENTlink modified 4 months ago • written 4 months ago by lieven.sterck3.1k

Sounds like a good plan. I will try this now.

ADD REPLYlink written 4 months ago by jaqx00840

I did this and it worked for one of the txt already but one gave this error. does it mean no match?

grep: empty (sub)expression
ADD REPLYlink written 4 months ago by jaqx00840

no, I'm guessing you might have empty/blank lines in your 1.txt file? or lines with an '|' in them?

ADD REPLYlink modified 4 months ago • written 4 months ago by lieven.sterck3.1k
0
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

To have grep do exact-word matches from a file of strings, use -w -F:

$ grep -w -F -f 1.txt 2.gff > 3.gff

The -w option does word searches using regular expressions. The -F option modifies this to do exact-string matching.

Using -w on its own will consume a lot of memory, because it will look for words in 1.txt that are contained in substrings of 2.gff.

In other words, if you have a string like 12345 in 1.txt, then using -w on its own will match a larger set of strings that contain 12345 (what you want), 123456, 1234567, 012345, etc. found in 2.gff.

The first match to 12345 is desired, but all other matches that contain 12345, like 123456 and 1234567 etc. are probably not what you want.

So by combining -w and -F, you get an exact match for the string you provide. So 12345 will only provide a hit on the word 12345, and not any other matches where 12345 is a prefix or suffix or substring of something else.

As a bonus, using -F makes grep consume a great deal less memory. Regular expressions use lots of memory, but string matching does not need to.

ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1564 users visited in the last hour