Showing Queries With No Hit In Tabbed Blast Output?
2
0
Entering edit mode
12.3 years ago
Prohan ▴ 350

Hi, I was wondering if anyone could tell me if there's a way to get the tabbed BLAST output to show if there's not hit. I don't mind using either blastall or blast+.

The reason I ask is that I'm trying to track big BLAST jobs - so big that BLAST XML gets too big to store and parse (slow) using Biopython. BLAST XML show's if there's no hit though.

Thanks

blast blast • 8.0k views
ADD COMMENT
2
Entering edit mode
12.3 years ago

extract and sort the name of the queries from the blast results and test for those who are missing in an original list using comm.

**update:

say blast will generate the following output:

AF178033    EMORG:AF178033  100.00  811 0   0   1   811 1   811 0.0 1566.6
AF178033    EMORG:AF031394  99.63   811 3   0   1   811 99  909 0.0 1542.8
AF178033    EMORG:AF031393  95.07   811 40  0   1   811 99  909 0.0 1249.4
AF178034    EMORG:AF178031  94.82   811 42  0   1   811 1   811 0.0 1233.5
AF178034    EMORG:AF178032  94.57   811 44  0   1   811 1   811 0.0 1217.7
AF178035    EMORG:AF353195  85.93   803 113 0   1   803 99  901 0.0 670.5
AF178038    EMORG:AF353192  85.86   806 114 0   1   806 99  904 0.0 670.5
AF178038    EMORG:AF353196  85.48   806 117 0   1   806 99  904 0.0 644.8

a second file contains the ordered query id:

$cat ids.txt 

AF178033
AF178034
AF178035
AF178036
AF178037
AF178038
AF178039
AF178040

to get the unmapped ids:

(your blastallcommand) | tee redirect.result.txt | cut -d '    ' -f 1 | uniq |\
    comm -31 - ids.txt

AF178036
AF178037
AF178039
AF178040
ADD COMMENT
0
Entering edit mode

I think I may have not explained correctly - I would like to be able to do this while the BLAST job is running.

Using comm the BLAST run would have to be complete.

ADD REPLY
0
Entering edit mode

but "comm" can use stdin (use '-' as one filename) and use 'tee' to save your original blast output in the pipeline

ADD REPLY
0
Entering edit mode

The point might be, while the blast process is still running, this approach cannot differentiate, whether there is a query sequence missing in the output because there was no hit or it hasn't been processed yet.

ADD REPLY
0
Entering edit mode

@Michael, but comm will only print the result after EOF or if the current query name is greater than the expected one (it's the same behavior than 'uniq' )

ADD REPLY
0
Entering edit mode

@Pierre - Can you explain this again. I don't quite understand.

ADD REPLY
0
Entering edit mode

Thanks - makes sense now.

ADD REPLY
0
Entering edit mode

This is a brilliant solution. One caveat though. I would use "sort -u" instead of uniq in the last command, because sometimes input.fasta file might not be in the sorted order, which causes blast output not be sorted either. so, (your blastallcommand) | tee redirect.result.txt | cut -d ' ' -f 1 | sort -u |\ comm -31 - ids.txt

ADD REPLY
1
Entering edit mode
12.1 years ago
Prohan ▴ 350

Thanks to Pierre's inspiration I had implemented into a Python script. Eventually though I came up with a bash one-liner:

lasthit=$(tail -n 1 input.fasta.blastn | awk '{print $1}') && grep '^> input.fasta | grep -n $lasthit

Basically - the last hit of the tab blast output is read by "tail -n 1" and the first column is read (the query id). This is assigned to the variable "lasthit". Next to find out how many hits have been processed you have to look at the query's spot in the original fasta. "grep '^>' gets all of the description lines in the fasta this is then piped again to grep where "-n" tells you the line that $lasthit is on - which should be the same as the number of sequences processed up til that hit.

ADD COMMENT

Login before adding your answer.

Traffic: 1517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6