GREP: out of memory
4
0
Entering edit mode
3.9 years ago
jaqx008 ▴ 110

Hello all, I have some .gff file that I am running grep on. for some of the files, grep runs ok but for others, it spits out the error message "grep: out of memory". it seem to be a common problem with grep but I havent really seen a solution that helps either on this forum or elsewhere. Beside I dont know much about memory management but my machine has 128GB and I dont know why I am getting this error. Help. Thanks in adavance. (my command is bellow and it runs fine for some of the files)

grep -f 1.txt 2.gff > 3.gff

.gff grep Deseq2 HEloci Mac os • 6.2k views
1
Entering edit mode

grep -f can indeed consume a ton of memory. It's nice that you received the "out of memory" error, I have crashed a 500Gb RAM server with a nasty grep -f.

1
Entering edit mode

jaqx008 : Are you grep'ing something super secret? Can you provide an actual example so we can avoid this endless back and forth? As @ATPoint said there may be an efficient way of doing what ever you are trying to do without using grep.

0
Entering edit mode

This is what I am trying to do. there are some gene IDs in a text file A, and lines possibly containing the gene IDs in gff file B. I am trying to identify the matches in B and outpur all the lines that match in B ( this should output all the columns in B. A has only one column of gene IDs B has has multiple colums and one of those columns has the gene IDs. so my command should pull out the corresponding columns of A in B. bellow is the command

$grep -f A.txt B.gff > AinB.gff  Is this clear enough? and is there another way to go about this without grep? Thanks  ADD REPLY 0 Entering edit mode Can you give us an idea of the numbers we're talking about here? file sizes? list length ? For most cases a solution like offered by arup (== loop through your list and grep each of them) will solve your issue ADD REPLY 0 Entering edit mode OK. the text files rang between 100bytes to about 50kb with word count around 1500 and in this format 2597857964 2597857966 2597857965 2597857963 2597860200 2597857153 2597857472  while the .gff range from about 600kb to about 900kb with wc of about 5000 to 7000 in the format bellow ## gff-version 3 Ga0036900_gi400319889.1 img_core_v400 CDS 109 1620 . + 0 ID=2597849326;locus_tag=Ga0036900_00001;product=chromosomal replication initiator protein DnaA Ga0036900_gi400319889.1 img_core_v400 CDS 1660 2763 . + 0 ID=2597849327;locus_tag=Ga0036900_00002;product=DNA polymerase III, beta subunit Ga0036900_gi400319889.1 img_core_v400 CDS 2784 3887 . + 0 ID=2597849328;locus_tag=Ga0036900_00003;product=DNA replication and repair protein RecF Ga0036900_gi400319889.1 img_core_v400 CDS 3893 6310 . + 0 ID=2597849329;locus_tag=Ga0036900_00004;product=DNA gyrase subunit B  Also, I did tried the loop but it exits with the error grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence grep: illegal byte sequence  ADD REPLY 1 Entering edit mode Where did you get your files from, which operating system? Did you open them in windows or so? I had to look myself but apparently this 'error' is related to the encoding of your file. If you have dos2unix (or mac2unix) installed, you might run that first on your files to convert them to proper unix encoding ADD REPLY 0 Entering edit mode What is output of ulimit -a? ADD REPLY 0 Entering edit mode oh I see. well ulimit -a output core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size (512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 709 virtual memory (kbytes, -v) unlimited  ADD REPLY 0 Entering edit mode Your account does not have a limit. So the error is due to possibilities enumerated by others. ADD REPLY 0 Entering edit mode What are you grep-ing? Maybe a more efficient way could be the use of tabix, depending on what you want to retrieve. ADD REPLY 0 Entering edit mode well I cant post more than 5 times in 6 hours and thats why I havent responded. BTW I am grep-ing a text file like 2345 3245 5432 1234  against a .gff file like 4634 - ID=2345 4353 + ID=3245 etc its working for some and giving memory complain for others. ADD REPLY 0 Entering edit mode Those patterns are too generic and likely generate many matches. That must be the reason why you are running out of memory. Are you trying to pull out specific genes? Once you post for a certain number of times (get rep points) that posting limit should go up. ADD REPLY 0 Entering edit mode Yes I am trying to pull out certain genes IDs with there corresponding information. I posted in the comment above what my files look like. the thing is, it works fine for some and does not for others. ADD REPLY 0 Entering edit mode This section from GNU-parallel may help you in automatic piping of both the files or single file: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep. Manual covers each situation (eg. limited RAM, limited CPU etc) ADD REPLY 0 Entering edit mode 3.9 years ago jaqx008 ▴ 110 I was able to resolve this issue. If anyone else runs into the same problem, the command bellow should help. first split files into few lines using split -l n file.txt  where n= number of lines desired (I did n = 5) $ for i in x*; do grep -f "$i" file2.gff > "$i".tmp.gff; done && cat *tmp.gff > output.gff

2
Entering edit mode
3.9 years ago

Rather than running at once pass the 1.txt entries one by one.

while read line; do grep -wF "$line" 2.gff; done < 1.txt  ADD COMMENT 3 Entering edit mode would personally add -m1 to this as it then does not unnecessarily keeps on looking through the file over and over again -m, --max-count=NUM stop after NUM matches ADD REPLY 0 Entering edit mode Hey, Where in the command would you put -m? I mean where in while read line; do grep -wF "$line" 2.gff; done < 1.txt

0
Entering edit mode

before the -wF part.

I must add that I mentioned this in thinking you only needed one line per grep but that's likely not the case. So it might be a bit tricky to use the -m option

0
Entering edit mode

It gives a long list that looks like this then terminates aftyer I run the command

grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: output.txt: No such file or directory
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence
grep: illegal byte sequence

0
Entering edit mode

are you grepping a pattern with spaces in it?

1
Entering edit mode
3.9 years ago

I'm no sys-admin myself but I believe there is (or can be) a limit on the amount of mem a command is allowed to use, regardless of the total amount of mem of the server (this to avoid one command killing the server).

I see you're doing a grep -f , so what you can do is to split up your 1.txt file in several parts and run them one of the other and append the result to the 3.gff file

0
Entering edit mode

Sounds like a good plan. I will try this now.

0
Entering edit mode

I did this and it worked for one of the txt already but one gave this error. does it mean no match?

grep: empty (sub)expression

0
Entering edit mode

no, I'm guessing you might have empty/blank lines in your 1.txt file? or lines with an '|' in them?

0
Entering edit mode
3.9 years ago

To have grep do exact-word matches from a file of strings, use -w -F:

\$ grep -w -F -f 1.txt 2.gff > 3.gff


The -w option does word searches using regular expressions. The -F option modifies this to do exact-string matching.

Using -w on its own will consume a lot of memory, because it will look for words in 1.txt that are contained in substrings of 2.gff.

In other words, if you have a string like 12345 in 1.txt, then using -w on its own will match a larger set of strings that contain 12345 (what you want), 123456, 1234567, 012345, etc. found in 2.gff.

The first match to 12345 is desired, but all other matches that contain 12345, like 123456 and 1234567 etc. are probably not what you want.

So by combining -w and -F, you get an exact match for the string you provide. So 12345 will only provide a hit on the word 12345, and not any other matches where 12345 is a prefix or suffix or substring of something else.

As a bonus, using -F makes grep consume a great deal less memory. Regular expressions use lots of memory, but string matching does not need to.