Question: Substractive Genomics Analysis
0
gravatar for waqarlodhi93
11 months ago by
waqarlodhi930 wrote:

EDITED

This is a my dataset look like:

  Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
    (strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

    Length=788
                                                                          Score     E
    Sequences producing significant alignments:                          (Bits)  Value

      sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=...  109     8e-24
      tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa...  107     4e-23
      sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s...  107     4e-23
      tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis...  102     1e-21
      tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis...  102     1e-21
      tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
      tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
      sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s...  101     3e-21
      tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O...  101     3e-21
      tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
      tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O...  100     6e-21
      tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr...  90.1    5e-18
      tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can...  89.7    5e-18
      tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h...  89.7    5e-18
      tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F...  87.0    6e-18
      tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ...  87.0    6e-18

    > sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606 
    GN=MSH4 PE=1 SV=2
    Length=936

     Score = 109 bits (273),  Expect = 8e-24, Method: Compositional matrix adjust.
     Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)



    > tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5B PE=2 SV=1
    Length=445

     Score = 590 bits (1522),  Expect = 0.0, Method: Compositional matrix adjust.
     Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
    --
    Query  423  SYVPVAETVRGFKEILEGKHDNLPEEAF  450
                  VP+ ET++GF++IL G++D+LPE+AF
    Sbjct  416  KLVPLKETIKGFQQILAGEYDHLPEQAF  443


    > tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
    Length=362

     Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
     Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
    --
    Query  342  DPLASSSSALAPEIVGEEHYEVATEVQ  368
                DPL S+S  + P IVG EHY+VA  VQ
    Sbjct  336  DPLDSTSRIMDPNIVGSEHYDVARGVQ  362


    > tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
    (Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
    Length=270

     Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
     Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
    --
    Query  265  LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD  312
                LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
    Sbjct  223  LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD  270





    Output i want is: 

    Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
    (strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

    > tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
    sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
    Length=362

     Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
     Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)

    > tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
    (Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
    Length=270

     Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
     Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)

    I want the Query of the respective strains having Identities 70% or greater.
alignment shell scripting • 387 views
ADD COMMENTlink modified 11 months ago • written 11 months ago by waqarlodhi930
1

What have you tried so far? Please post your current code so people can provide you feedback on that and help you with this question.

ADD REPLYlink written 11 months ago by Sej Modha4.3k
1

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 11 months ago by WouterDeCoster40k

Thanks.. for your help, but unfortunately its not working.. i actually i want to keep the header and further work out with your command of grep.. so if u could me out with this.

ADD REPLYlink written 11 months ago by waqarlodhi930

Hi there,

Please note that it is not recommended to post any additional comment and follow up questions as answers, please use ADD REPLY to comment on the solution posted. I have reformatted this for you at this time.

ADD REPLYlink written 11 months ago by Sej Modha4.3k

Hello waqarlodhi93,

See https://www.gnu.org/software/grep/manual/grep.html for more information about line control and checkout -A parameter.

2.1.5 Context Line Control
Context lines are non-matching lines that are near a matching line. They are output only if one of the following options are used. Regardless of how these options are set, grep never outputs any given line more than once. If the -o (--only-matching) option is specified, these options have no effect and a warning is given upon their use.

-A num
--after-context=num
Print num lines of trailing context after matching lines.

-B num
--before-context=num
Print num lines of leading context before matching lines
ADD REPLYlink written 11 months ago by Sej Modha4.3k
4
gravatar for sacha
11 months ago by
sacha1.8k
France
sacha1.8k wrote:

use regular expression with grep to select line with Identities = 228/327 (70%) and print 5 line before ( -B 5 ) More than 70% can be expressed as : (([7-9]\d|100)

cat your_file.txt |grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'

ADD COMMENTlink modified 11 months ago • written 11 months ago by sacha1.8k

Thanks @sacha, your provided command is really helpful but i want some thing more look into the detail below. This is a my dataset look like:

Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

Length=788
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

  sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=...  109     8e-24
  tr|A8K1E1|A8K1E1_HUMAN cDNA FLJ75589, highly similar to Homo sa...  107     4e-23
  sp|P20585|MSH3_HUMAN DNA mismatch repair protein Msh3 OS=Homo s...  107     4e-23
  tr|B4DSB9|B4DSB9_HUMAN cDNA FLJ51069, highly similar to DNA mis...  102     1e-21
  tr|B4DL39|B4DL39_HUMAN cDNA FLJ57316, highly similar to DNA mis...  102     1e-21
  tr|A0A2R8YFH0|A0A2R8YFH0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|A0A2R8Y6P0|A0A2R8Y6P0_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|B4DN49|B4DN49_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
  tr|E9PHA6|E9PHA6_HUMAN DNA mismatch repair protein OS=Homo sapi...  101     3e-21
  sp|P43246|MSH2_HUMAN DNA mismatch repair protein Msh2 OS=Homo s...  101     3e-21
  tr|Q53GS1|Q53GS1_HUMAN DNA mismatch repair protein (Fragment) O...  101     3e-21
  tr|A0A2R8YG02|A0A2R8YG02_HUMAN DNA mismatch repair protein OS=H...  101     3e-21
  tr|Q53FK0|Q53FK0_HUMAN DNA mismatch repair protein (Fragment) O...  100     6e-21
  tr|B4DZX3|B4DZX3_HUMAN cDNA FLJ54211, highly similar to MutS pr...  90.1    5e-18
  tr|A0A0G2JJ70|A0A0G2JJ70_HUMAN MSH5-SAPCD1 readthrough (NMD can...  89.7    5e-18
  tr|A2ABF0|A2ABF0_HUMAN cDNA FLJ39914 fis, clone SPLEN2018732, h...  89.7    5e-18
  tr|Q9UFG2|Q9UFG2_HUMAN Uncharacterized protein DKFZp434C1615 (F...  87.0    6e-18
  tr|H0YF11|H0YF11_HUMAN MSH5-SAPCD1 readthrough (NMD candidate) ...  87.0    6e-18

> sp|O15457|MSH4_HUMAN MutS protein homolog 4 OS=Homo sapiens OX=9606 
GN=MSH4 PE=1 SV=2
Length=936

 Score = 109 bits (273),  Expect = 8e-24, Method: Compositional matrix adjust.
 Identities = 71/228 (31%), Positives = 118/228 (52%), Gaps = 8/228 (4%)



> tr|Q0QEN7|Q0QEN7_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5B PE=2 SV=1
Length=445

 Score = 590 bits (1522),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 300/448 (67%), Positives = 357/448 (80%), Gaps = 12/448 (3%)
--
Query  423  SYVPVAETVRGFKEILEGKHDNLPEEAF  450
              VP+ ET++GF++IL G++D+LPE+AF
Sbjct  416  KLVPLKETIKGFQQILAGEYDHLPEQAF  443


> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362

 Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
 Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)
--
Query  342  DPLASSSSALAPEIVGEEHYEVATEVQ  368
            DPL S+S  + P IVG EHY+VA  VQ
Sbjct  336  DPLDSTSRIMDPNIVGSEHYDVARGVQ  362


> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270

 Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
 Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)
--
Query  265  LGRMPSAVGYQPTLATEMGQLQERITSTKKGSITSIQAIYVPADDYTD  312
            LGR+PSAVGYQPTLAT+MG +QERIT+TKKGSITS+QAIYVPADD TD
Sbjct  223  LGRIPSAVGYQPTLATDMGTMQERITTTKKGSITSVQAIYVPADDLTD  270





Output i want is: 

Query= sp|Q835H3|MUTS2_ENTFA Endonuclease MutS2 OS=Enterococcus faecalis
(strain ATCC 700802 / V583) OX=226185 GN=mutS2 PE=3 SV=1

> tr|H0YH81|H0YH81_HUMAN ATP synthase subunit beta (Fragment) OS=Homo 
sapiens OX=9606 GN=ATP5F1B PE=1 SV=1
Length=362

 Score = 459 bits (1182),  Expect = 1e-158, Method: Compositional matrix adjust.
 Identities = 228/327 (70%), Positives = 265/327 (81%), Gaps = 7/327 (2%)

> tr|F8W0P7|F8W0P7_HUMAN ATP synthase subunit beta, mitochondrial 
(Fragment) OS=Homo sapiens OX=9606 GN=ATP5F1B PE=1 SV=2
Length=270

 Score = 281 bits (720),  Expect = 1e-90, Method: Compositional matrix adjust.
 Identities = 137/168 (82%), Positives = 151/168 (90%), Gaps = 6/168 (4%)

I want the Query of the respective strains having Identities 70% or greater.
ADD REPLYlink written 11 months ago by waqarlodhi930

So, just remove the header ( with awk for instance) and apply my previous command line.

cat test.txt |awk 'BEGIN{keep=0}{if ($0 ~ "^>"){keep=1} if (keep == 1) print($0)}'|grep -P -B5 'Identities = \d+/\d+\s\(([7-9]\d|100)%'
ADD REPLYlink modified 11 months ago by Sej Modha4.3k • written 11 months ago by sacha1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1626 users visited in the last hour