grep do exact-word matches from a file of strings, use
$ grep -w -F -f 1.txt 2.gff > 3.gff
-w option does word searches using regular expressions. The
-F option modifies this to do exact-string matching.
-w on its own will consume a lot of memory, because it will look for words in
1.txt that are contained in substrings of
In other words, if you have a string like
1.txt, then using
-w on its own will match a larger set of strings that contain
12345 (what you want),
012345, etc. found in
The first match to
12345 is desired, but all other matches that contain
1234567 etc. are probably not what you want.
So by combining
-F, you get an exact match for the string you provide. So
12345 will only provide a hit on the word
12345, and not any other matches where
12345 is a prefix or suffix or substring of something else.
As a bonus, using
grep consume a great deal less memory. Regular expressions use lots of memory, but string matching does not need to.
grep -fcan indeed consume a ton of memory. It's nice that you received the "out of memory" error, I have crashed a 500Gb RAM server with a nasty
jaqx008 : Are you
grep'ing something super secret? Can you provide an actual example so we can avoid this endless back and forth? As @ATPoint said there may be an efficient way of doing what ever you are trying to do without using
This is what I am trying to do. there are some gene IDs in a text file A, and lines possibly containing the gene IDs in gff file B. I am trying to identify the matches in B and outpur all the lines that match in B ( this should output all the columns in B. A has only one column of gene IDs B has has multiple colums and one of those columns has the gene IDs. so my command should pull out the corresponding columns of A in B. bellow is the command
Can you give us an idea of the numbers we're talking about here? file sizes? list length ?
For most cases a solution like offered by arup (== loop through your list and grep each of them) will solve your issue
OK. the text files rang between 100bytes to about 50kb with word count around 1500 and in this format
while the .gff range from about 600kb to about 900kb with wc of about 5000 to 7000 in the format bellow
Also, I did tried the loop but it exits with the error
Where did you get your files from, which operating system? Did you open them in windows or so? I had to look myself but apparently this 'error' is related to the encoding of your file.
If you have dos2unix (or mac2unix) installed, you might run that first on your files to convert them to proper unix encoding
What is output of
oh I see. well ulimit -a output
Your account does not have a limit. So the error is due to possibilities enumerated by others.
What are you grep-ing? Maybe a more efficient way could be the use of tabix, depending on what you want to retrieve.
well I cant post more than 5 times in 6 hours and thats why I havent responded. BTW I am grep-ing a text file like
against a .gff file like
4634 - ID=2345 4353 + ID=3245 etc its working for some and giving memory complain for others.
Those patterns are too generic and likely generate many matches. That must be the reason why you are running out of memory. Are you trying to pull out specific genes?
Once you post for a certain number of times (get rep points) that posting limit should go up.
Yes I am trying to pull out certain genes IDs with there corresponding information. I posted in the comment above what my files look like. the thing is, it works fine for some and does not for others.
This section from GNU-parallel may help you in automatic piping of both the files or single file: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Parallel-grep. Manual covers each situation (eg. limited RAM, limited CPU etc)