Hi all,
I have done BLASTp searches on multiple genomes and pfam analysis. So now, I have numerous .pfam result files (text files) and I would like to extract only identical matches between two genomes.
Instead of doing it manually, is there a bash command or else to extract all identical words from text files?
I tried this:
comm -12 file1.txt file2.txt > output.txt
But as I expected, I only have the common lines between the files while I need all common words...
If you have any ideas, please let me know!
Thank you in advance for your help.
EDIT: file examples
Here is an example of file1:
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
WxL PF13731.7 Lactococcus_893 - 3.7e-07 30.4 7.2 5.2e-07 29.9 7.2 1.2 1 0 0 1 1 1 1 WxL domain surface cell wall-binding
DUF916 PF06030.13 Lactococcus_894 - 7.3e-20 71.3 1.3 8.7e-20 71.0 1.3 1.0 1 0 0 1 1 1 1 Bacterial protein of unknown function (DUF916)
DUF3324 PF11797.9 Lactococcus_895 - 3.2e-31 108.2 1.3 3.9e-31 107.9 1.3 1.1 1 0 0 1 1 1 1 Protein of unknown function C-terminal (DUF3324)
APH PF01636.24 Lactococcus_896 - 3.2e-07 30.5 0.1 4.4e-07 30.0 0.1 1.1 1 0 0 1 1 1 1 Phosphotransferase enzyme family
Here is an example of file 2:
# --- full sequence ---- --- best 1 domain ---- --- domain number estimation ----
# target name accession query name accession E-value score bias E-value score bias exp reg clu ov env dom rep inc description of target
#------------------- ---------- -------------------- ---------- --------- ------ ----- --------- ------ ----- --- --- --- --- --- --- --- --- ---------------------
Beta-lactamase2 PF13354.7 Lactococcus_549 - 1.9e-33 115.8 0.0 2.8e-33 115.2 0.0 1.3 1 0 0 1 1 1 1 Beta-lactamase enzyme family
S1 PF00575.24 Lactococcus_550 - 0.014 15.8 0.7 0.47 10.9 0.7 2.3 1 1 0 1 1 1 0 S1 RNA binding domain
DUF3324 PF11797.9 Lactococcus_551 - 3.4e-08 33.2 0.5 3.4e-08 33.2 0.5 1.9 2 0 0 2 2 2 1 Protein of unknown function C-terminal (DUF3324)
As in the two .txt files there is only the DUF3324 target name in common, I would want in an output file simply like this:
DUF3324
I'm really thankful if you want to help me on this...
Then simply cut the first column out of two files (unix command,
cut
) and then find the common words (unix command,comm
) in the columns that were cut out.are your blast output in regular format or as TSV (-outfmt 6), or what format?
Thank you for your reply JC. All my output files are as TSV (-outfmt 6). So technically, I would like to extract all common words in the second column between all my .txt files.
Could you give an example of how these BLASTp result files look like?
Why does this not work? Finding common lines in a tsv and then cutting out the column (or vice versa) are equivalent operations. Do your IDs in the match column contain whitespaces?