I'm running command line blastx and blastp against a number of databases. However, running the exact same script on the exact same input files against the exact same databases occasionally seems to output different filesizes.
I can only assume that this is because the task is failing - but how can I QC the outputs to tell which output file was successful? I know I can look at the log files, but some of these I am inheriting and I want to be able to double check if they were successful.
The outputs tend to be outfmt6 or asn then converted to asn (I'm following a specific methodology, this isn't my choice)
blastx \ -query /input-file.fa \ -db nr \ -num_threads 24 \ -max_target_seqs 10 \ -outfmt 11 \ -out output-file.asn blast_formatter \ -archive output-file.asn \ -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" > \ output-file.outfmt6
May be you can fetch first column and uniq it to see how many sequences have got hits in both the files. Like this:
Further, if you need, you can take it's output in file and can compare both files using "diff" command to check for the difference in sequence IDs covered in both the files.
Thanks for this but I don't understand how it helps (that's an indictment of me not your help!)
You say I should use the code to see how many get hits in both the files, but there's only one input file. Assuming one file comes out with more, is that the file I should be using?
I might be wrong here, but what I mean is when you run blastp for 2 times and getting different file size, there might be the possibility that, in one of the file, one or more sequences might have missed/skipped. In this case you can use the above command and select file having more number of sequence IDs.
Also, you can convert .asn file to fmt 5 or 16, to see whether all the sequences given as input where processed by blast or not as these format gives information of all the sequences irrespective of whether there is hit obtained or "No hits found". For eg. in case of fm5, you can use:
Or if you want to just know the sequences having hits each time you do blast, you can use below command for fmt 5:
In this way you can know that whenever you do blast, number of sequences processed and having hits remains same or not.