Dear All I am new at writing codes and simple scripts . I have local blastp result file that has multiple hits fora single accession number. I have been able to separate a file with all acession numbers and the "Score" line using"grep" command. The file gave the results as expected . Now I want to clean it further. My file pattern is
>QAA79571.1 polyprotein [West Nile virus]
Score = 1548 bits (4007), Expect = 0.0, Method: Compositional matrix adjust.
>QAA79502.1 polyprotein [West Nile virus]
Score = 1540 bits (3988), Expect = 0.0, Method: Compositional matrix adjust.
Score = 198 bits (504), Expect = 4e-52, Method: Compositional matrix adjust.
>QAA79547.1 polyprotein [West Nile virus]
Score = 1539 bits (3985), Expect = 0.0, Method: Compositional matrix adjust.
Score = 199 bits (505), Expect = 3e-52, Method: Compositional matrix adjust.
>QAA79538.1 polyprotein [West Nile virus]
Score = 1529 bits (3958), Expect = 0.0, Method: Compositional matrix adjust.
>QAA79567.1 polyprotein, partial [West Nile virus]
Score = 1499 bits (3882), Expect = 0.0, Method: Compositional matrix adjust.
I would like to only select the results that have multiple "Score" line with the accession number Output:
>QAA79502.1 polyprotein [West Nile virus]
Score = 1540 bits (3988), Expect = 0.0, Method: Compositional matrix adjust.
Score = 198 bits (504), Expect = 4e-52, Method: Compositional matrix adjust.
>QAA79547.1 polyprotein [West Nile virus]
Score = 1539 bits (3985), Expect = 0.0, Method: Compositional matrix adjust.
Score = 199 bits (505), Expect = 3e-52, Method: Compositional matrix adjust.
I tried some solutions with awk and grep but am not getting the correct script for it. If the question has been asked before, please direct me to it. If not please guide me on what solution can be be. I am alos tying to learn R but since I have just newly started with scripting, it takes me a while to learn.. Thank you so much Regards
I see that you are using default alignment output of blast. You might consider (re)running blast with the tabular output (outfmt 6 or 7 option ), that one is much easier to parse and looking at what you want to get it should also contain all the info you're looking for.