Hi,
I am having an issue with running blastx (using the ncbi-blast-2.6.0+ binaries) via the cluster and I was wondering if anyone could help.
I split my genome (in .fa format) into 100 pieces (to allow parallelisation and speed up the process) and then submitted each section to blastx against the nr database using the following command (via a cluster)
blastx -db nr -num_threads 8 -evalue 1e-10 -query part-001.fa -out part-001.fa.RESULT.xml -outfmt 14
The input query file, part-001.fa has 210 genes in it (the entire genome has ~ 21 000 genes), but the output XMl file (part-001.fa.RESULT.xml), when loaded into blast2go, only has 75 genes. So I'm missing about 65% of my genes in the final output. When I load all 100 result xml files, I have around 600 genes.
I had a closer look at the genes in the ouput file and the ones in the input file and it seemed as though blast was only analysing the first 75 files in each fasta sequence and then not bothering with the rest. The second file, it only analysed 60 out of 188 and so on.
Obviously this isn't ideal, as I'm missing about 65% of my total output! Can anyone offer any ideas as to why blastx might only be reading the first ~35% of sequences in each fasta file?
Thanks
EDIT: I should also point out that I requested 1GB per job on the cluster and each outfile was about 60mb, so I don't think it's because the jobs are terminating early because of a lack of memory.
Are the blast results sequentially missed? Your sequences may not be passing your e-value threshold. What happens when you make it less stringent?
I should have said - it is always the last 65% of the genes which are missed, regardless of their e-value. Some of the genes it returns actually don't have any blast hits, so it can't be that it is systemically not returning genes which don't have hits.
How did you split your genome.fa file into 100 pieces? I ask because this sounds to me like an error in the formatting of your FASTA file(s).
When you say the input query file has 210 "genes" in it, do you mean FASTA-formatted sequences? With a header line that starts with a ">", followed by the sequence identifier, a new line, and then the nucleotide sequence? If so, are you certain that your FASTA file truly has 210 sequences? And that the sequence identifiers are all unique? (You can test this with:
grep '^>' part-001.fa | sort -u | wc -l
.Also be careful if you used
split
to split your original fasta if it was a multi-line fasta. Easiest would be to linearize your sequences (if they aren't already), then split your file.grep -c '>' part-001.fa
will only count lines containing a ">"; we want to make sure they're at the start of a line (that is, not adjacent to the end of the previous nucleotide sequence, caused by some error in splitting). We also want to sort uniquely to make sure duplicate sequence names aren't being searched, which could explain why BLAST2GO is reporting a smaller number of "genes", as it would only report unique entries.Thanks for your reply. The problem isn't coming from blast2go, as the output from blastx gives you 1 gene per .xml output file, and the number of genes registered in b2g always corresponds with the number of input .xml files.
I am 99% sure that the issue is the blastx program itself, as the genes in the rest of the pipeline always seem to add up.
Hi, thanks for your reply.
I used http://kirill-kryukov.com/study/tools/fasta-splitter/ I checked that the splitter hadn't accidentally removed sequenced by merging the files back together again and using
To count the number of genes in the fasta. I also checked and it is a proplerly formatted fasta.
Maybe it's a problem with your cluster's scheduler, then? It could be that your jobs are timing out, not because of a lack of memory but because they're reached their time limit. This could explain why only the first ~35% of your input FASTAs are being searched.
Ah ok - that's not something that I'd have thought of - but it definitely makes sense! It didn't leave any kind of error but I'll emial the sysadmin and ask..