I've been running blastp using a large number of queries on a cluster, but am having an unexpected issue of very obvious hits not being reported in the output. Essentially I did RNA-Seq and de novo transcriptome assembly on an organism of interest (with no sequenced genome) straight out of a mouse, and am trying to use BLAST to remove mouse genes and annotate the remaining genes. So I have a list of protein sequences from this (~40,000 proteins) of which many are mouse contamination, and others are from my organism of interest. To determine which are mouse, I created a blast database from all the mouse proteins on uniprot (~100,000 proteins), and then ran the following command on a cluster:
blastp -query my_sequences.fasta -num_threads 16 -db uniprot-mouse -max_target_seqs 1 -outfmt "6 std stitle" -evalue 1e-30 -word_size 7 > output.txt
I used word size of 7 and a low e-value since I expect contaminating mouse proteins to align with almost perfect identity.
My problem is that the output is missing a lot of hits. I've found many query proteins that align with near-perfect identity to mouse proteins that were simply not listed in my blast output. When I blast them on the webserver they are identified with evalues well below my 1e-30 threshold, and I've double checked my mouse protein database and these proteins are in there. Does anyone have any idea why this might be happening?
For extra information, I'm running this job on a cluster with the following parameters:
--time=48:00:00 --cpus-per-task=16 --mem=32000 --mail-type = FAIL
And am using
And I am not getting a failure email, so it doesn't look like the run is simply timing out and thus truncating my output.
Thank you very much, any insights would be extremely appreciated!