Blastx No hit query?
3
1
Entering edit mode
7.8 years ago
Buffo ★ 2.4k

Hi guys, I`m doing blastx on linux command line, my output is on outmft 6 because i want to annotate the results, the problem is that in the current release of blastx the querys whithout hit does not appear on the table result, somebody know how get that sequences in other file? I have read the manual and it looks like there is no option for doing that, somebody have a perl or python script for that? or some ideas? Thanks everybody!!

blastx outfmt6 nohit • 1.9k views
ADD COMMENT
1
Entering edit mode

use comm with your input+output to find the missing queries

ADD REPLY
1
Entering edit mode

Mmm.. I`m not sure if that works with tabular format :-S

ADD REPLY
1
Entering edit mode

extract the query name from both files. sort. compute the intersection with comm. And, yes, it works.

ADD REPLY
0
Entering edit mode

More complicated solution, but works perfect too :-) I posted a python script in other comment to do that. Thanks Pierre :-)

ADD REPLY
2
Entering edit mode
7.8 years ago

Try this python script:

import sys
from Bio import SeqIO
faFile = open(sys.argv[1],'r')
queries = set([strrecord.id) for record in SeqIO.parse(faFile,'fasta')])
hits = set([x.split()[0] for x in open(sys.argv[2],'r').read().strip().split('\n')])
noHits = queries - hits
print '\n'.join(noHits)

Save as noHits.py and run by:

python noHits.py queries.fasta blastx.tab.output > noHits.ids

This script basically just gets a list of query ids and list of hit ids. Then it prints out ids of any query id that's not in the hit id list.

ADD COMMENT
0
Entering edit mode

Thank you very much Damian, there is a mistake:

queries = set([strrecord.id) for record in seqIO.parse(faFile,'fasta')]) ^ SyntaxError: invalid syntax

ADD REPLY
0
Entering edit mode

the 's' in seqIO should be uppercase. I've edited the original post.

ADD REPLY
0
Entering edit mode

Still wrong, I posted new version that works in other comment, but thank you very much Damian :-)

ADD REPLY
0
Entering edit mode
7.8 years ago
5heikki 11k

It's basically like this:

export LC_ALL=C; comm -3 <(grep "^>" file.fasta | sort) <(cut -f1 blast.tsv | sort)

It might not work because blast may report shorter query sequence IDs than full headers. In this case, I would need to see a few headers and blast output lines so the first process substitution could be modified accordingly.

ADD COMMENT
0
Entering edit mode
7.8 years ago
Buffo ★ 2.4k

It works perfect;

ussage python python_script.py file1 file2

from __future__ import division
import sys
from Bio import SeqIO
import os

file1 = open(sys.argv[1],'r')
lista1 = []

for line in file1:
        line = line.rstrip('\n')
        lista1.append(line)
file1.close()

file2 = open(sys.argv[2],'r')
lista2 = []

for line in file2:
        line = line.rstrip('\n')
        lista2.append(line)
file2.close()

same = set(lista1).intersection(set(lista2))

file_out = open('some_output_file.txt', 'w')
for line in same:
    file_out.write(line)

file_out.close()
ADD COMMENT

Login before adding your answer.

Traffic: 3063 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6