I am having an issue with running blastx (using the ncbi-blast-2.6.0+ binaries) via the cluster and I was wondering if anyone could help.
I split my genome (in .fa format) into 100 pieces (to allow parallelisation and speed up the process) and then submitted each section to blastx against the nr database using the following command (via a cluster)
blastx -db nr -num_threads 8 -evalue 1e-10 -query part-001.fa -out part-001.fa.RESULT.xml -outfmt 14
The input query file, part-001.fa has 210 genes in it (the entire genome has ~ 21 000 genes), but the output XMl file (part-001.fa.RESULT.xml), when loaded into blast2go, only has 75 genes. So I'm missing about 65% of my genes in the final output. When I load all 100 result xml files, I have around 600 genes.
I had a closer look at the genes in the ouput file and the ones in the input file and it seemed as though blast was only analysing the first 75 files in each fasta sequence and then not bothering with the rest. The second file, it only analysed 60 out of 188 and so on.
Obviously this isn't ideal, as I'm missing about 65% of my total output! Can anyone offer any ideas as to why blastx might only be reading the first ~35% of sequences in each fasta file?
EDIT: I should also point out that I requested 1GB per job on the cluster and each outfile was about 60mb, so I don't think it's because the jobs are terminating early because of a lack of memory.