How To Join Text Files By A Certain Column (Accession Id)
4
3
Entering edit mode
8.8 years ago
2011101101 ▴ 110

I don't know how to say it. I have an example and I hope everyone understand me. There two files,1.txt,2.txt . The 1.txt is like this.

gi|25254180|gb|CA674435.1|CA674435    gi|345895254|ref|YP_004841985.1|    96.30    108    4    0    22    345    16    123    4e-51     204
gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141
gi|39557933|gb|CK195543.1|CK195543    gi|357139181|ref|XP_003571163.1|    87.95    83    10    0    739    491    64    146    1e-35     148


The 2.txt is this

 gi|93045967|gb|CJ563048.1|CJ563048


The result file is 3.txt ,it is this

gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141


That's all .Thank you .

extraction • 4.3k views
13
Entering edit mode
8.8 years ago
lh3 32k

In this forum, it is a frequent pitfall to use grep only with -f. I used to give my answer Retrieve Single Vcf From List Of Snp Rs#, but I feel it would be better to expand that answer a little bit.

grep has an option -f. According to its manpage: -f FILE: Obtain patterns from FILE, one per line. Each line is interpreted as a regex pattern. I speculate in that case, grep -f will load all the patterns in RAM and for each input line it attempts each pattern in turn. Thus if there are 1000 lines in the pattern file, grep -f will be 1000 times as slow as grepping one pattern only (i.e. linear in the number of patterns).

grep also has an option -F, which is equivalent to fgrep. fgrep uses the Aho–Corasick algorithm, an algorithm that constructs one DFA for all the pattern strings. When grep push an input line through the DFA, all patterns are attempted at the same time. Thus the speed of grep -F is largely independent of the number of patterns (i.e. constant to the number of patterns). For fixed string matching, we should use grep -Ff; otherwise the speed will be extremely slow (see the benchmark in Retrieve Single Vcf From List Of Snp Rs#). In addition, it is recommended to add option -w to reduce false matching.

I have fallen to this pitfall for many years myself. I used to ask my colleagues not to use grep -f because it is slow. But one day, I suddenly felt something wrong: Aho–Corasick algorithm should be as fast as grepping one pattern; why grep -f is so slow? When I read the manpage, it became obvious that I was misusing grep for a long time. On the other hand, I have to say the design of both -f and -F is really confusing.

EDIT: the approximate answer is:

grep -wFf 2.txt 1.txt > 3.txt


which works most of time, but may have false positives in rare cases where the patterns appear in other columns. The accurate answer is:

awk 'BEGIN{while((getline<"2.txt")>0)l[$1]=1}l[$1]' 1.txt > 3.txt


which works all the time.

0
Entering edit mode

grep would be a nicer design if -w didn't break up words except on whitespace. The '|' and other common symbols are word breaks. I agree, it's easy to be burned. I just wish -w was a bit different, or configurable in some way. It would make scenarios like these easier.

0
Entering edit mode

You say here to add -w to reduce false positive matches, but in the linked answer you say it may create false positive matches. That is confusing, it seems like the former would be correct and I don't understand how this changes the output if -F is treating input as fixed strings. Seems like -F alone should do it, but it's hard to tell looking at one example.

1
Entering edit mode

The -F can match shorter strings, and the -w could fix that if it behaved more nicely. If the file has "hugabear papa" and you use something like  grep -F -e "huga" file 

It'll match. The -w will enforce that an entire word matches: You'd need to specify -e "hugabear" (or -e "papa") to match. As I stated before, -w has limitations in that it breaks up words on many non-whitespace characters, which essentially degenerates back to using just the -F (with the same match-on-shorter-words problem).

0
Entering edit mode

Good example, I can see clearly how the -w matches differently now.

0
Entering edit mode

if I have a header in both 1.txt and 2.txt, does this command still work?

4
Entering edit mode
8.8 years ago

Just to add another potential solution [making the assumption that we are continuing to live in a UNIX world]:

Command:

sort -k 1 1.txt | join 2.txt -


Result:

gi|93045967|gb|CJ563048.1|CJ563048 gi|225427470|ref|XP_002270678.1| 63.27 98 34 1 617 330 124 221 3e-48 141


HTH

If 2.txt has more than one entry it is necessary to also sort it by its key (e.g. first column) by executing

sort -k 1 2.txt > 2.sorted.txt


and modifying above command to:

sort -k 1 1.txt | join 2.sorted.txt -

0
Entering edit mode

This problem is so simple to solve with sort and join that I wouldn't be arsed messing around with grep and fgrep, personally

0
Entering edit mode

Typically, 1.txt is a huge file or is already sorted by other fields. Sorting 1.txt back and forth is clumsy, inefficient and may be quite complicated (e.g. in case of VCF, you wouldn't want to sort the header). Unless 2.txt is too large to fit in RAM, the right solution is to load 2.txt in memory and filter 1.txt line by line. Join only becomes better when 2.txt is too large, which rarely happens. As to grep, that is the most efficient solution if you are sure there are no false positives.

0
Entering edit mode

Good comment regarding the memory efficiency, but there are quite a lot of assumptions that you are making about the structure/organisation of the data. If every computational problem would follow "typical" cases we bioinformaticians would be pretty much out of work. We were given a very specific question, with very limited data, and several solutions to this particular problem exist. This is one of them.

1
Entering edit mode

Your answer is fine. I have no problems with that. I was responding to the previous comment. For this question, grep/awk is simpler and more efficient. Recommending sort/join for this question while criticizing grep is misleading. BTW, a better version of your answer is: sort -k1 1.txt | join <(sort 2.txt) -.

0
Entering edit mode

Sorry, I obviously misread your comment - and yes, it's a better version :-)

0
Entering edit mode

I wasn't trying to criticise grep, only that in cases such as this I find it much easier with sort and join etc. as it's straight forward and does what it says on the tin. I wasn't considering efficiencies as I've never found myself waiting any prolonged time so as to consider an alternative.

1
Entering edit mode
8.8 years ago
SES 8.5k

Here is one solution (edit: see also Retrieve Single Vcf From List Of Snp Rs# for an explanation of -F [fgrep]):

fgrep -f 2.txt 1.txt > 3.txt

0
Entering edit mode

also add a -F to make things faster since you're matching on fixed strings (and not regular expressions) fgrep -F -f 2.txt 1.txt > 3.txt

1
Entering edit mode

I think fgrep searches fixed strings from -f file. Same as grep -Ff, am I wrong?

0
Entering edit mode

Ah, yes, you are correct: fgrep is the same as grep -Ff. I often forget the various related grep-ish family programs. Sctoopid sjneph. Good catch.

1
Entering edit mode
8.8 years ago
2011101101 ▴ 110

Thank you .I have get the answer fgrep -w -Ff 2.txt 1.txt > 3.txt

0
Entering edit mode

If you end up going with this specific solution you'll probably want to leave off the -w, as discussed above.

0
Entering edit mode

Why? I think we should use -w. If 1.txt contains a line abcgi|93045967|gb|CJ563048.1|CJ563048, grep -wFf will not match, but grep -Ff will give a false match.

0
Entering edit mode

I was concerned that it would break up the words based on previous comments but I guess fgrep will enforce that it doesn't?