How to extract identical characters in multiple text files?
2
1
Entering edit mode
3.7 years ago
A_heath ▴ 160

Hi all,

I have done BLASTp searches on multiple genomes and pfam analysis. So now, I have numerous .pfam result files (text files) and I would like to extract only identical matches between two genomes.

Instead of doing it manually, is there a bash command or else to extract all identical words from text files?

I tried this:

comm -12 file1.txt file2.txt > output.txt

But as I expected, I only have the common lines between the files while I need all common words...

If you have any ideas, please let me know!

Thank you in advance for your help.

EDIT: file examples

Here is an example of file1:

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
WxL                  PF13731.7  Lactococcus_893      -            3.7e-07   30.4   7.2   5.2e-07   29.9   7.2   1.2   1   0   0   1   1   1   1 WxL domain surface cell wall-binding
DUF916               PF06030.13 Lactococcus_894      -            7.3e-20   71.3   1.3   8.7e-20   71.0   1.3   1.0   1   0   0   1   1   1   1 Bacterial protein of unknown function (DUF916)
DUF3324              PF11797.9  Lactococcus_895      -            3.2e-31  108.2   1.3   3.9e-31  107.9   1.3   1.1   1   0   0   1   1   1   1 Protein of unknown function C-terminal (DUF3324)
APH                  PF01636.24 Lactococcus_896      -            3.2e-07   30.5   0.1   4.4e-07   30.0   0.1   1.1   1   0   0   1   1   1   1 Phosphotransferase enzyme family

Here is an example of file 2:

#                                                               --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name        accession  query name           accession    E-value  score  bias   E-value  score  bias   exp reg clu  ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ -----   --- --- --- --- --- --- --- --- ---------------------
Beta-lactamase2      PF13354.7  Lactococcus_549      -            1.9e-33  115.8   0.0   2.8e-33  115.2   0.0   1.3   1   0   0   1   1   1   1 Beta-lactamase enzyme family
S1                   PF00575.24 Lactococcus_550      -              0.014   15.8   0.7      0.47   10.9   0.7   2.3   1   1   0   1   1   1   0 S1 RNA binding domain
DUF3324              PF11797.9 Lactococcus_551       -            3.4e-08   33.2   0.5   3.4e-08   33.2   0.5   1.9   2   0   0   2   2   2   1 Protein of unknown function C-terminal (DUF3324)

As in the two .txt files there is only the DUF3324 target name in common, I would want in an output file simply like this:

DUF3324

I'm really thankful if you want to help me on this...

bash • 1.8k views
ADD COMMENT
1
Entering edit mode

I would want in an output file simply like this:

DUF3324

Then simply cut the first column out of two files (unix command, cut) and then find the common words (unix command, comm) in the columns that were cut out.

ADD REPLY
0
Entering edit mode

are your blast output in regular format or as TSV (-outfmt 6), or what format?

ADD REPLY
0
Entering edit mode

Thank you for your reply JC. All my output files are as TSV (-outfmt 6). So technically, I would like to extract all common words in the second column between all my .txt files.

ADD REPLY
0
Entering edit mode

Could you give an example of how these BLASTp result files look like?

ADD REPLY
0
Entering edit mode

Why does this not work? Finding common lines in a tsv and then cutting out the column (or vice versa) are equivalent operations. Do your IDs in the match column contain whitespaces?

ADD REPLY
3
Entering edit mode
3.7 years ago
Joe 21k

The command you found is correct, you just need to use it correctly:

comm -12 <(tail -n+4 file1.txt | cut -d ' ' -f 1| sort ) <(tail -n+4 file2.txt | cut -d ' ' -f 1| sort )

You don't need to do all of this in one line if you don't want to, but you do need to:

  1. Trim the header lines out (tail -n+4)
  2. Isolate just the first column, as defined by whitespace (cut -d ' ' -f 1)
  3. sort the files in order for comm to work (if they aren't already sorted, which these aren't).
ADD COMMENT
0
Entering edit mode

Yes it's working perfectly, thank you so much Joe! I didn't think the command comm was the one to use for this but I was wrong.. I've read multiple times that the files need to be sorted with the sort command as you wrote before comm, why doing this? What is a "sorted file" exactly?

ADD REPLY
1
Entering edit mode

A sorted file is exactly what it sounds like. It is sorted in some way - that can be alphabetically, numerically (such as via natural numbers/integer sorting etc).

In brief, comm needs the files to be sorted so that it can tell if the lines in the files are the same, since it doesn't look at the whole file at once.

See the toy example below:

$ cat com1
abc
def
hij

$ cat com2
hij
abc
def

Going line by line, comm can tell immediately that line 1 is different in both, but until its seen the whole thing it would have no way to know that def occurs in com2. However, if the files are sorted, comm 'knows' that lines in common cannot be scattered throughout the file, and must occur adjacent to one another, else you'll get the following behaviour:

$ comm com1 com2
abc
def
        hij
comm: file 2 is not in sorted order
    abc
    def
$ comm <(sort com1) <(sort com2)
        abc
        def
        hij
ADD REPLY
1
Entering edit mode

comm, cut, sort and a whole host of other useful text manipulation tools are part of unix's coreutils. It will save you hours/days/weeks further down the line if you spend some time getting familiar with these now.

ADD REPLY
1
Entering edit mode

Well, let me just say thank you so much Joe for this clear explanation! So helpful ... Sorry for my ignorance, just starting on using bash commands so I'll definitely check coreutils tools for future use.

And if I understood the comm -12 syntax correctly, it can not be run on multiple text files, right? Because my attempt to do a loop failed but it makes total sense.

ADD REPLY
1
Entering edit mode

When in doubt check man comm (manual entry/in-line help for comm). Tool description tells you what it can do.

For comm that produces :

 -- select or reject lines common to two files (on macOS)

comm - compare two sorted files line by line (on unix)
ADD REPLY
1
Entering edit mode

-12 is 'short-hand' for -1 -2. If you look at the help (or manual output as genomax suggests), you'll see that there are 3 options: -1, -2, and -3.

comm --help gives:

With no options, produce three-column output.  Column one contains
lines unique to FILE1, column two contains lines unique to FILE2
and column three contains lines common to both files.

  -1              suppress column 1 (lines unique to FILE1)
  -2              suppress column 2 (lines unique to FILE2)
  -3              suppress column 3 (lines that appear in both files)

You can then combine these with this 'shorthand' depending on what you do and don't want to see. Since, by default, comm will print a column (1) for lines unique to file one (i.e. not in file 2), (2) the inverse (lines not in 1), or (3), lines common to both.

Since the latter (3) is what you were interested in, you suppressed columns -1 and -2 (which is the same as -12.

In short, with comm, no you cannot compare more than 2 files at once to do 3 way or greater comparisons. You can artificially compare 1 file with n files by temporarily concatenating them for example, but this is not an entirely equivalent operation.

Other tools exist, and certainly one could write a relatively small script to compare the overlaps between 3 or more files, but it will become quite unwieldy both combinatorically and computationally very quickly.

ADD REPLY
1
Entering edit mode
3.7 years ago
JC 13k

You gonna need some scripting, here how I will do in Perl:

#!/usr/bin/perl

use strict;
use warnings;

my $num_files = 0;
my %counts = ();
foreach my $file (@ARGV) {
    my %cnt = ();
    $num_files++;
    open (my $fh, "<", $file) or die "cannot read $file\n";
    while (<$fh>) {
        my ($id1, $id2) = split (/\t/, $_);
        unless (defined $cnt{$id2}) {
           $counts{$id2}++; # global count per file
        }
        else {
            $cnt{$id2}++; # internal count in each file
        }
    }
    close $fh;
}

foreach my $id (keys %counts) {
    print "$id\n" if ($counts{$id} == $num_files);
}

then you can run it as:

perl getCommon.pl *txt > common_ids.txt
ADD COMMENT
0
Entering edit mode

Thank you so much for your help. Unfortunately, I got the following error message when I run this script:

    Global symbol "$counts" requires explicit package name (did you forget to declare "my $counts"?) at getCommon.pl line 24.
Experimental keys on scalar is now forbidden at getCommon.pl line 24.
Type of arg 1 to keys must be hash or array (not scalar dereference) at getCommon.pl line 24, near "$counts) "
Execution of getCommon.pl aborted due to compilation errors.

There would seemed to be a problem with the package, among others. Do you know how I can fix that?

And also, is there any "standard" method to extract all common words from any .txt files? Because I also have this issue with .pfam files.

Thank you once again for your precious help.

ADD REPLY
0
Entering edit mode

I misspelled a variable, it is fixed now

ADD REPLY

Login before adding your answer.

Traffic: 2578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6