Is There A Perl Script To Parse A Blast File According To Gene Name (Gn=??) ?
1
0
Entering edit mode
11.8 years ago
tanjafiegel ▴ 40

I generate large BLAST files.

I have a perl script from http://www.bios.niu.edu/johns/bioinform/perl_blast_parser.htm which sorts my BLAST output file into the individual protein sequences within "my" database and lists the e-scores along with HSP and identities.

Is there a similar script that lists them under gene name? This would make sorting for me easier.

Many thanks.

• 4.1k views
ADD COMMENT
1
Entering edit mode

You should write a test case. Write an example of input file, and an example of the output that you want to get, and then post it here.

ADD REPLY
0
Entering edit mode

Do you have the gene name in the fasta headers or a mapping between the identifiers in fasta to gene name ?

ADD REPLY
0
Entering edit mode

No I don't have the gene nemae in the fasta headers. I don't think anyway.

I use the blastall -p blastx command to run reads against a protein-sequence-database I made with formatdb. So the BLAST output file I get gives me all the hits for each protein which lists the protein name in the header along with gene name as, for example, GN=mecA.

So the perl script I am incapable of writing myself, within this year anyway, sorts the hits like this

tr|H6LSH4|H6LSH4_STAAU:

tr|H6LSH4|H6LSH4_STAAU Undecaprenyl-diphosphatase OS=Staphylococcus aureus subsp. aureus VC40 GN=uppP PE=3 SV=1 Length = 291 3e-37 72 0 72 Plus 216 1 103 174

but it would be better to have them sorted according to GN=mecA or similar.

I just wondered if anyone knew if there was a script that did that.

:)

ADD REPLY
0
Entering edit mode

I agree with Giovanni's comment, because then it is easy to figure out what you need

ADD REPLY
1
Entering edit mode
11.8 years ago
Joachim ★ 2.9k

Save the output from perlblastparser into a file, which I will call test_data in the following.

The contents of test_data should look somewhat like this (note the empty line at the end of the file):

SW:X2:
     >SW:X2 A0KFJ6 Homoserine O-succinyltransferase GN=a PE=3 SV=1
     Length = 317
          e-118 201 0 299 Plus 1 299 1 299

SW:X3:
     >SW:X3 A0KFJ6 Homoserine O-succinyltransferase GN=b PE=3 SV=1
     Length = 317
          e-118 201 0 299 Plus

SW:X1:
     >SW:X1 A0KFJ6 Homoserine O-succinyltransferase GN=c PE=3 SV=1
     Length = 317
          e-118 201 0 299 Plus 1 299 1 299

Then create a Ruby script sortBy.rb like this:

lines = IO.readlines("test_data")

sortBy='GN='

entries = {}
n = 0

while n + 4 < lines.size do
        key = lines[n + 1].match(sortBy + '[^ ]+')[0].sub(sortBy, '')

        entries[key] = lines[n] + lines[n + 1] + lines[n + 2] + lines[n + 3] + lines[n + 4]

        n += 5
end

entries.keys.sort.each { |key|
        puts entries[key]
}

You can then sort your output by running: ruby sortBy.rb. Simply change the variable sortBy if you want to sort on something else than the GN field.

ADD COMMENT
0
Entering edit mode

Thank you very much for actually writing a ruby script!

Unfortunately it does give me a null pointer error message in line 9: "undefined method `[]' for nil:NilClass (NoMethodError)". This has probably to do with the file I get after the perl script, rather than your code. Unfortunately, I don't know how to fix it. But many thanks anyway!

ADD REPLY
0
Entering edit mode

You are welcome. The script will work for data that is formatted as in my example (SW:X2:, etc). If you have additional lines on top of your input, or at the very end (for example, a newline at the end of the input), then the scripts breaks. It might be good to post an example of your input next time, which you can format correctly by putting it in a separate paragraph and indenting each line by four spaces.

ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6