Commandline BLAST - errors?
1
0
Entering edit mode
6 weeks ago
bry.th • 0

Hi,

I'm running command line blastx and blastp against a number of databases. However, running the exact same script on the exact same input files against the exact same databases occasionally seems to output different filesizes.

I can only assume that this is because the task is failing - but how can I QC the outputs to tell which output file was successful? I know I can look at the log files, but some of these I am inheriting and I want to be able to double check if they were successful.

The outputs tend to be outfmt6 or asn then converted to asn (I'm following a specific methodology, this isn't my choice)

blastx \
-query /input-file.fa \
-db nr \
-max_target_seqs 10 \
-outfmt 11 \
-out output-file.asn

blast_formatter \
-archive output-file.asn \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" > \
output-file.outfmt6


Thanks!

line BLAST RNA-seq command • 438 views
1
Entering edit mode

May be you can fetch first column and uniq it to see how many sequences have got hits in both the files. Like this:

cut -f1 output-file.outfmt6 | sort | uniq | wc -l


Further, if you need, you can take it's output in file and can compare both files using "diff" command to check for the difference in sequence IDs covered in both the files.

0
Entering edit mode

Hi,

Thanks for this but I don't understand how it helps (that's an indictment of me not your help!)

You say I should use the code to see how many get hits in both the files, but there's only one input file. Assuming one file comes out with more, is that the file I should be using?

1
Entering edit mode

I might be wrong here, but what I mean is when you run blastp for 2 times and getting different file size, there might be the possibility that, in one of the file, one or more sequences might have missed/skipped. In this case you can use the above command and select file having more number of sequence IDs.

Also, you can convert .asn file to fmt 5 or 16, to see whether all the sequences given as input where processed by blast or not as these format gives information of all the sequences irrespective of whether there is hit obtained or "No hits found". For eg. in case of fm5, you can use:

grep -c "<Iteration_query-def>"  filename.xml


Or if you want to just know the sequences having hits each time you do blast, you can use below command for fmt 5:

 grep -c "<Hit_num>1</Hit_num>"  filename.xml


In this way you can know that whenever you do blast, number of sequences processed and having hits remains same or not.

1
Entering edit mode
6 weeks ago

You could check the exit status of blast, at least blastx and blast_formatter report proper exit codes.

blastx \
-query /input-file.fa \
-db nr \
-max_target_seqs 10 \
-outfmt 11 \
-out output-file.asn 2> blastx_STDERR

return_value=$? if [${return_value} -ne 0 ]; then
echo "Error in blastx run!"
echo "Error message was:"
cat STDERR
fi

blast_formatter \
-archive output-file.asn \
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" > \
output-file.outfmt6 2> STDERR

return_value=$? if [${return_value} -ne 0 ]; then
echo "Error in blast_formatter run!"
echo "Error message was:"
cat STDERR
fi

0
Entering edit mode

Thanks I'll give this a try