Question: How To Join Text Files By A Certain Column (Accession Id)
2
gravatar for 2011101101
6.3 years ago by
2011101101100
2011101101100 wrote:

I don't know how to say it. I have an example and I hope everyone understand me. There two files,1.txt,2.txt . The 1.txt is like this.

gi|25254180|gb|CA674435.1|CA674435    gi|345895254|ref|YP_004841985.1|    96.30    108    4    0    22    345    16    123    4e-51     204
gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141
gi|39557933|gb|CK195543.1|CK195543    gi|357139181|ref|XP_003571163.1|    87.95    83    10    0    739    491    64    146    1e-35     148

The 2.txt is this

 gi|93045967|gb|CJ563048.1|CJ563048

The result file is 3.txt ,it is this

gi|93045967|gb|CJ563048.1|CJ563048    gi|225427470|ref|XP_002270678.1|    63.27    98    34    1    617    330    124    221    3e-48     141

That's all .Thank you .

extraction • 3.7k views
ADD COMMENTlink modified 6.3 years ago by Sebastian Kurscheid300 • written 6.3 years ago by 2011101101100
11
gravatar for lh3
6.3 years ago by
lh331k
United States
lh331k wrote:

In this forum, it is a frequent pitfall to use grep only with -f. I used to give my answer Retrieve Single Vcf From List Of Snp Rs#, but I feel it would be better to expand that answer a little bit.

grep has an option -f. According to its manpage: -f FILE: Obtain patterns from FILE, one per line. Each line is interpreted as a regex pattern. I speculate in that case, grep -f will load all the patterns in RAM and for each input line it attempts each pattern in turn. Thus if there are 1000 lines in the pattern file, grep -f will be 1000 times as slow as grepping one pattern only (i.e. linear in the number of patterns).

grep also has an option -F, which is equivalent to fgrep. fgrep uses the Aho–Corasick algorithm, an algorithm that constructs one DFA for all the pattern strings. When grep push an input line through the DFA, all patterns are attempted at the same time. Thus the speed of grep -F is largely independent of the number of patterns (i.e. constant to the number of patterns). For fixed string matching, we should use grep -Ff; otherwise the speed will be extremely slow (see the benchmark in Retrieve Single Vcf From List Of Snp Rs#). In addition, it is recommended to add option -w to reduce false matching.

I have fallen to this pitfall for many years myself. I used to ask my colleagues not to use grep -f because it is slow. But one day, I suddenly felt something wrong: Aho–Corasick algorithm should be as fast as grepping one pattern; why grep -f is so slow? When I read the manpage, it became obvious that I was misusing grep for a long time. On the other hand, I have to say the design of both -f and -F is really confusing.

EDIT: the approximate answer is:

grep -wFf 2.txt 1.txt > 3.txt

which works most of time, but may have false positives in rare cases where the patterns appear in other columns. The accurate answer is:

awk 'BEGIN{while((getline<"2.txt")>0)l[$1]=1}l[$1]' 1.txt > 3.txt

which works all the time.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by lh331k

grep would be a nicer design if -w didn't break up words except on whitespace. The '|' and other common symbols are word breaks. I agree, it's easy to be burned. I just wish -w was a bit different, or configurable in some way. It would make scenarios like these easier.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by sjneph600

You say here to add -w to reduce false positive matches, but in the linked answer you say it may create false positive matches. That is confusing, it seems like the former would be correct and I don't understand how this changes the output if -F is treating input as fixed strings. Seems like -F alone should do it, but it's hard to tell looking at one example.

ADD REPLYlink written 6.3 years ago by SES8.2k
1

The -F can match shorter strings, and the -w could fix that if it behaved more nicely. If the file has "hugabear papa" and you use something like grep -F -e "huga" file

It'll match. The -w will enforce that an entire word matches: You'd need to specify -e "hugabear" (or -e "papa") to match. As I stated before, -w has limitations in that it breaks up words on many non-whitespace characters, which essentially degenerates back to using just the -F (with the same match-on-shorter-words problem).

ADD REPLYlink written 6.3 years ago by sjneph600

Good example, I can see clearly how the -w matches differently now.

ADD REPLYlink written 6.3 years ago by SES8.2k
4
gravatar for Sebastian Kurscheid
6.3 years ago by
Australia, ACT, Canberra, ANU
Sebastian Kurscheid300 wrote:

Just to add another potential solution [making the assumption that we are continuing to live in a UNIX world]:

Command:

sort -k 1 1.txt | join 2.txt -

Result:

gi|93045967|gb|CJ563048.1|CJ563048 gi|225427470|ref|XP_002270678.1| 63.27 98 34 1 617 330 124 221 3e-48 141

HTH

If 2.txt has more than one entry it is necessary to also sort it by its key (e.g. first column) by executing

sort -k 1 2.txt > 2.sorted.txt

and modifying above command to:

sort -k 1 1.txt | join 2.sorted.txt -
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Sebastian Kurscheid300

This problem is so simple to solve with sort and join that I wouldn't be arsed messing around with grep and fgrep, personally

ADD REPLYlink written 6.3 years ago by Daniel3.7k

Typically, 1.txt is a huge file or is already sorted by other fields. Sorting 1.txt back and forth is clumsy, inefficient and may be quite complicated (e.g. in case of VCF, you wouldn't want to sort the header). Unless 2.txt is too large to fit in RAM, the right solution is to load 2.txt in memory and filter 1.txt line by line. Join only becomes better when 2.txt is too large, which rarely happens. As to grep, that is the most efficient solution if you are sure there are no false positives.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by lh331k

Good comment regarding the memory efficiency, but there are quite a lot of assumptions that you are making about the structure/organisation of the data. If every computational problem would follow "typical" cases we bioinformaticians would be pretty much out of work. We were given a very specific question, with very limited data, and several solutions to this particular problem exist. This is one of them.

ADD REPLYlink written 6.3 years ago by Sebastian Kurscheid300
1

Your answer is fine. I have no problems with that. I was responding to the previous comment. For this question, grep/awk is simpler and more efficient. Recommending sort/join for this question while criticizing grep is misleading. BTW, a better version of your answer is: sort -k1 1.txt | join <(sort 2.txt) -.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by lh331k

Sorry, I obviously misread your comment - and yes, it's a better version :-)

ADD REPLYlink written 6.3 years ago by Sebastian Kurscheid300

I wasn't trying to criticise grep, only that in cases such as this I find it much easier with sort and join etc. as it's straight forward and does what it says on the tin. I wasn't considering efficiencies as I've never found myself waiting any prolonged time so as to consider an alternative.

ADD REPLYlink written 6.3 years ago by Daniel3.7k
1
gravatar for SES
6.3 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Here is one solution (edit: see also Retrieve Single Vcf From List Of Snp Rs# for an explanation of -F [fgrep]):

fgrep -f 2.txt 1.txt > 3.txt
ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by SES8.2k

also add a -F to make things faster since you're matching on fixed strings (and not regular expressions) fgrep -F -f 2.txt 1.txt > 3.txt

ADD REPLYlink written 6.3 years ago by sjneph600
1

I think fgrep searches fixed strings from -f file. Same as grep -Ff, am I wrong?

ADD REPLYlink written 6.3 years ago by SES8.2k

Ah, yes, you are correct: fgrep is the same as grep -Ff. I often forget the various related grep-ish family programs. Sctoopid sjneph. Good catch.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by sjneph600
1
gravatar for 2011101101
6.3 years ago by
2011101101100
2011101101100 wrote:

Thank you .I have get the answer fgrep -w -Ff 2.txt 1.txt > 3.txt

ADD COMMENTlink written 6.3 years ago by 2011101101100

If you end up going with this specific solution you'll probably want to leave off the -w, as discussed above.

ADD REPLYlink written 6.3 years ago by SES8.2k

Why? I think we should use -w. If 1.txt contains a line abcgi|93045967|gb|CJ563048.1|CJ563048, grep -wFf will not match, but grep -Ff will give a false match.

ADD REPLYlink written 6.3 years ago by lh331k

I was concerned that it would break up the words based on previous comments but I guess fgrep will enforce that it doesn't?

ADD REPLYlink written 6.3 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 653 users visited in the last hour