How to extract entire chunks of text from a Profile Hidden Markov .hmm file
1
0
Entering edit mode
5.3 years ago
jxi21 • 0

Hello all, I am currently using a Hidden Markov model-based approach to detect viruses in metagenomics data. I use a profile made by the Pasteur institut based on vFAMs by Peter Skewes-Cox et al., 2014.

After using the profile with HMMer and providing translated contigs in every reading frame, the HMMs were able to identify the expected viruses in positive controls. Nonetheless, a lot of matches (with an evalue of 10^-10 or less for both conditional and independent) match to bacterial regions with 100% identity and ~98% coverage according to BLAST.

These false positives have something in common: according to the HMMs they match to endogenous retroviruses or giant viruses proteins (example: Zn-dependent alcohol dehydrogenase, ABC transporter, etc).

Therefore, I decided to see if I can eliminate these entries from the profile so I can diminish the false positives and made a list of all the families that have an annotation related to retroviruses or giant viruses.

I copy a chunk of my profile here as explanation:

HMMER3/f [3.1b2 | February 2015]
NAME  FAM007957
LENG  1078
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Fri Oct 12 20:02:22 2018
NSEQ  7
EFFN  0.591309
CKSUM 134316360
STATS LOCAL MSV      -12.5867  0.69540
STATS LOCAL VITERBI  -13.9281  0.69540
STATS LOCAL FORWARD   -6.9899  0.69540
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.52786  4.09835  2.76055  2.58333  3.30703  2.91930  3.80486  2.88354  2.60376  2.56225  3.71312  2.89938  3.51565  3.18472  2.93829  2.53713  2.89512  2.66587  4.91819  3.50321
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.16684  3.93795  2.00858  0.61958  0.77255  0.00000        *
//
HMMER3/f [3.1b2 | February 2015]
NAME  FAM006805
LENG  283
ALPH  amino
RF    no
MM    no
CONS  yes
CS    no
MAP   yes
DATE  Fri Oct 12 20:20:45 2018
NSEQ  8
EFFN  0.714844
CKSUM 174391985
STATS LOCAL MSV      -11.1126  0.70178
STATS LOCAL VITERBI  -11.7648  0.70178
STATS LOCAL FORWARD   -5.4313  0.70178
HMM          A        C        D        E        F        G        H        I        K        L        M        N        P        Q        R        S        T        V        W        Y   
            m->m     m->i     m->d     i->m     i->i     d->m     d->d
  COMPO   2.58563  4.40070  2.84295  2.49411  3.55282  3.12077  3.71148  2.77600  2.56241  2.36701  3.54429  2.93369  3.66844  3.05176  2.79705  2.67258  2.87961  2.67320  4.73491  3.80457
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.02701  4.02100  4.74335  0.61958  0.77255  0.00000        *
      1   3.09160  4.61822  4.21161  3.81854  3.28069  3.94629  4.51938  2.47147  3.57779  1.85500  1.11955  4.07700  4.40970  3.95105  3.76521  3.45517  3.40087  2.49434  5.14000  3.91374      1 m - - -
          2.68618  4.42225  2.77519  2.73123  3.46354  2.40513  3.72494  3.29354  2.67741  2.69355  4.24690  2.90347  2.73739  3.18146  2.89801  2.37887  2.77519  2.98518  4.58477  3.61503
          0.02701  4.02100  4.74335  0.61958  0.77255  0.48576  0.95510
//

My question is, how could I make a clean cut of the matrix comprised between HMMER3/f [3.1b2 | February 2015] and the // characters and matches to the names in my list (NAME FAM006805 as in the header).

I appreciate any suggestions. Thanks!

Francisco Iturralde-Martinez

gene • 1.3k views
ADD COMMENT
1
Entering edit mode

Can you give an example of the format of the output you’re looking for?

ADD REPLY
0
Entering edit mode

Hello, thanks for your response, I expected to have it in the same matrix format (HMM profile), only eliminating those that match by name.

Your code actually works but is giving me as result those that match to the list, how can I make it so it gives me those that are not in the list?

ADD REPLY
0
Entering edit mode

It is fine, I can also previously filter the list to have it for all the entries that match instead. Thanks so much!

ADD REPLY
0
Entering edit mode

I edit the code so it now only prints results which doesn't match the name.

I moved my comment to an answer.

fin swimmer

ADD REPLY
2
Entering edit mode
5.3 years ago

Something like this?

$ cat list.txt
FAM007956

$ awk 'NR==FNR { id[$1]++; next } {RS="//"; ORS="//"; FS="\n";  match($0,/NAME[ \t]+([^\n]+)/, name); if(name[1] in id == 0) {print}}' list.txt inputfile > outputfile

fin swimmer

EDIT: Now the code prints only those records, which are not listed in list.txt

ADD COMMENT

Login before adding your answer.

Traffic: 1607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6