Question: Showing Queries With No Hit In Tabbed Blast Output?
0
gravatar for Prohan
7.7 years ago by
Prohan350
United States
Prohan350 wrote:

Hi, I was wondering if anyone could tell me if there's a way to get the tabbed BLAST output to show if there's not hit. I don't mind using either blastall or blast+.

The reason I ask is that I'm trying to track big BLAST jobs - so big that BLAST XML gets too big to store and parse (slow) using Biopython. BLAST XML show's if there's no hit though.

Thanks

blast • 4.6k views
ADD COMMENTlink modified 7.5 years ago • written 7.7 years ago by Prohan350
2
gravatar for Pierre Lindenbaum
7.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

extract and sort the name of the queries from the blast results and test for those who are missing in an original list using comm.

**update:

say blast will generate the following output:

AF178033    EMORG:AF178033  100.00  811 0   0   1   811 1   811 0.0 1566.6
AF178033    EMORG:AF031394  99.63   811 3   0   1   811 99  909 0.0 1542.8
AF178033    EMORG:AF031393  95.07   811 40  0   1   811 99  909 0.0 1249.4
AF178034    EMORG:AF178031  94.82   811 42  0   1   811 1   811 0.0 1233.5
AF178034    EMORG:AF178032  94.57   811 44  0   1   811 1   811 0.0 1217.7
AF178035    EMORG:AF353195  85.93   803 113 0   1   803 99  901 0.0 670.5
AF178038    EMORG:AF353192  85.86   806 114 0   1   806 99  904 0.0 670.5
AF178038    EMORG:AF353196  85.48   806 117 0   1   806 99  904 0.0 644.8

a second file contains the ordered query id:

$cat ids.txt 

AF178033
AF178034
AF178035
AF178036
AF178037
AF178038
AF178039
AF178040

to get the unmapped ids:

(your blastallcommand) | tee redirect.result.txt | cut -d '    ' -f 1 | uniq |\
    comm -31 - ids.txt

AF178036
AF178037
AF178039
AF178040
ADD COMMENTlink modified 7.7 years ago • written 7.7 years ago by Pierre Lindenbaum122k

I think I may have not explained correctly - I would like to be able to do this while the BLAST job is running.

Using comm the BLAST run would have to be complete.

ADD REPLYlink written 7.7 years ago by Prohan350

but "comm" can use stdin (use '-' as one filename) and use 'tee' to save your original blast output in the pipeline

ADD REPLYlink written 7.7 years ago by Pierre Lindenbaum122k

The point might be, while the blast process is still running, this approach cannot differentiate, whether there is a query sequence missing in the output because there was no hit or it hasn't been processed yet.

ADD REPLYlink written 7.7 years ago by Michael Dondrup46k

@Michael, but comm will only print the result after EOF or if the current query name is greater than the expected one (it's the same behavior than 'uniq' )

ADD REPLYlink written 7.7 years ago by Pierre Lindenbaum122k

@Pierre - Can you explain this again. I don't quite understand.

ADD REPLYlink written 7.7 years ago by Prohan350

Thanks - makes sense now.

ADD REPLYlink written 7.7 years ago by Prohan350

This is a brilliant solution. One caveat though. I would use "sort -u" instead of uniq in the last command, because sometimes input.fasta file might not be in the sorted order, which causes blast output not be sorted either. so, (your blastallcommand) | tee redirect.result.txt | cut -d ' ' -f 1 | sort -u |\ comm -31 - ids.txt

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by Alby90
1
gravatar for Prohan
7.5 years ago by
Prohan350
United States
Prohan350 wrote:

Thanks to Pierre's inspiration I had implemented into a Python script. Eventually though I came up with a bash one-liner:

lasthit=$(tail -n 1 input.fasta.blastn | awk '{print $1}') && grep '^> input.fasta | grep -n $lasthit

Basically - the last hit of the tab blast output is read by "tail -n 1" and the first column is read (the query id). This is assigned to the variable "lasthit". Next to find out how many hits have been processed you have to look at the query's spot in the original fasta. "grep '^>' gets all of the description lines in the fasta this is then piped again to grep where "-n" tells you the line that $lasthit is on - which should be the same as the number of sequences processed up til that hit.

ADD COMMENTlink written 7.5 years ago by Prohan350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour