Inquiry: How to parse blast.txt standard output?
1
0
Entering edit mode
9.0 years ago
hlyates ▴ 10

Is there any tool/scripts that are already out there that can parse standard blast.txt output? I can provide an example of my output if that would help?

I am learning bioinformatics and recently run my first major scripts which used blastn and received standard blast.txt output. I need a tool to parse the output to report the following:

  1. Significant e-value
  2. Source sequence (my fasta sequence that was my input against the nt database)
  3. Target species id, name, and etc (the nt sequence my input had an alignment with)

I am familiar with python somewhat and feel that would be my best tool to work on first. I thank you in advance for your assistance and kindly time you spent reading this question.

blast python • 4.6k views
ADD COMMENT
1
Entering edit mode

I am thinking about rerunning this and producing tabular output. If I did, would there be a tool to easily do the same thing as I wrote above?

ADD REPLY
0
Entering edit mode

Tabular output is much easier to parse than the text format or some of the other formats. You can also more easily customize the output in tabular formats.

ADD REPLY
0
Entering edit mode

Can Biopython easily parse this format as well?

ADD REPLY
0
Entering edit mode

Not sure about BioPython, but parsing tab delimited files in python is very easy.

ADD REPLY
3
Entering edit mode
9.0 years ago
arnstrm ★ 1.8k

If you're familiar with python then use the Biopython. Easily you can parse all the information you want form the standard output of blast.

Yes, tabular output (format 6) will give you 1 and 2. For taxonomy, I would recommend adding this field to the default blast output

-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids"

Which can then be used to get the full lineage information form the taxonomy database (you can also do it with GI IDs, but this is more straight forward.

ADD COMMENT
0
Entering edit mode

Interesting. When I read the blastn --help documentation, for example, I thought it would be -m 6. In reality, it is -outfmt "6 whatever field values I need"? I assume this is the same for blastx? If so, can you please provide me the documentation source where qseqid, sseqid, pident, and so on are documented and described? I would like to learn more please.

Most importantly, many humble thanks for your helpful response.

ADD REPLY
0
Entering edit mode

Yes, just type blastn -help and you will find the documentation for these options. The complete list is as follows:

   Options 6, 7, and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
            qseqid means Query Seq-id
               qgi means Query GI
              qacc means Query accesion
           qaccver means Query accesion.version
              qlen means Query sequence length
            sseqid means Subject Seq-id
         sallseqid means All subject Seq-id(s), separated by a ';'
               sgi means Subject GI
            sallgi means All subject GIs
              sacc means Subject accession
           saccver means Subject accession.version
           sallacc means All subject accessions
              slen means Subject sequence length
            qstart means Start of alignment in query
              qend means End of alignment in query
            sstart means Start of alignment in subject
              send means End of alignment in subject
              qseq means Aligned part of query sequence
              sseq means Aligned part of subject sequence
            evalue means Expect value
          bitscore means Bit score
             score means Raw score
            length means Alignment length
            pident means Percentage of identical matches
            nident means Number of identical matches
          mismatch means Number of mismatches
          positive means Number of positive-scoring matches
           gapopen means Number of gap openings
              gaps means Total number of gaps
              ppos means Percentage of positive-scoring matches
            frames means Query and subject frames separated by a '/'
            qframe means Query frame
            sframe means Subject frame
              btop means Blast traceback operations (BTOP)
           staxids means unique Subject Taxonomy ID(s), separated by a ';'
                         (in numerical order)
         sscinames means unique Subject Scientific Name(s), separated by a ';'
         scomnames means unique Subject Common Name(s), separated by a ';'
        sblastnames means unique Subject Blast Name(s), separated by a ';'
                         (in alphabetical order)
        sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                         (in alphabetical order)
            stitle means Subject Title
        salltitles means All Subject Title(s), separated by a '<>'
           sstrand means Subject Strand
             qcovs means Query Coverage Per Subject
           qcovhsp means Query Coverage Per HSP
   When not provided, the default value is:
   'qseqid sseqid pident length mismatch gapopen qstart qend sstart send
ADD REPLY
0
Entering edit mode

Seems that if I use the command above with tabular. I might not even need biopython to parse except for showing results that are statistically significant? I'm only interested in hits of course. :)

ADD REPLY
0
Entering edit mode

You don't need biopython for that either, just parse them in vanilla python. You can tell blast to only report hits that meet various thresholds as well.

You might want to check out the BLAST+ documentation.

ADD REPLY

Login before adding your answer.

Traffic: 2949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6