Question: blastx only blasting a small subset of sequences
1
gravatar for sm15766
2.4 years ago by
sm1576620
sm1576620 wrote:

Hi,

I am having an issue with running blastx (using the ncbi-blast-2.6.0+ binaries) via the cluster and I was wondering if anyone could help.

I split my genome (in .fa format) into 100 pieces (to allow parallelisation and speed up the process) and then submitted each section to blastx against the nr database using the following command (via a cluster)

blastx -db nr -num_threads 8 -evalue 1e-10 -query part-001.fa -out part-001.fa.RESULT.xml -outfmt 14

The input query file, part-001.fa has 210 genes in it (the entire genome has ~ 21 000 genes), but the output XMl file (part-001.fa.RESULT.xml), when loaded into blast2go, only has 75 genes. So I'm missing about 65% of my genes in the final output. When I load all 100 result xml files, I have around 600 genes.

I had a closer look at the genes in the ouput file and the ones in the input file and it seemed as though blast was only analysing the first 75 files in each fasta sequence and then not bothering with the rest. The second file, it only analysed 60 out of 188 and so on.

Obviously this isn't ideal, as I'm missing about 65% of my total output! Can anyone offer any ideas as to why blastx might only be reading the first ~35% of sequences in each fasta file?

Thanks

EDIT: I should also point out that I requested 1GB per job on the cluster and each outfile was about 60mb, so I don't think it's because the jobs are terminating early because of a lack of memory.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by sm1576620

Are the blast results sequentially missed? Your sequences may not be passing your e-value threshold. What happens when you make it less stringent?

ADD REPLYlink written 2.4 years ago by st.ph.n2.5k
1

I should have said - it is always the last 65% of the genes which are missed, regardless of their e-value. Some of the genes it returns actually don't have any blast hits, so it can't be that it is systemically not returning genes which don't have hits.

ADD REPLYlink written 2.4 years ago by sm1576620

How did you split your genome.fa file into 100 pieces? I ask because this sounds to me like an error in the formatting of your FASTA file(s).

When you say the input query file has 210 "genes" in it, do you mean FASTA-formatted sequences? With a header line that starts with a ">", followed by the sequence identifier, a new line, and then the nucleotide sequence? If so, are you certain that your FASTA file truly has 210 sequences? And that the sequence identifiers are all unique? (You can test this with: grep '^>' part-001.fa | sort -u | wc -l.

ADD REPLYlink written 2.4 years ago by jonfoox10
grep -c '>' part-001.fa will also work.

Also be careful if you used split to split your original fasta if it was a multi-line fasta. Easiest would be to linearize your sequences (if they aren't already), then split your file.

ADD REPLYlink written 2.4 years ago by st.ph.n2.5k

grep -c '>' part-001.fa will only count lines containing a ">"; we want to make sure they're at the start of a line (that is, not adjacent to the end of the previous nucleotide sequence, caused by some error in splitting). We also want to sort uniquely to make sure duplicate sequence names aren't being searched, which could explain why BLAST2GO is reporting a smaller number of "genes", as it would only report unique entries.

ADD REPLYlink written 2.4 years ago by jonfoox10

Thanks for your reply. The problem isn't coming from blast2go, as the output from blastx gives you 1 gene per .xml output file, and the number of genes registered in b2g always corresponds with the number of input .xml files.

I am 99% sure that the issue is the blastx program itself, as the genes in the rest of the pipeline always seem to add up.

ADD REPLYlink written 2.4 years ago by sm1576620

Hi, thanks for your reply.

I used http://kirill-kryukov.com/study/tools/fasta-splitter/ I checked that the splitter hadn't accidentally removed sequenced by merging the files back together again and using

grep -c "^>"

To count the number of genes in the fasta. I also checked and it is a proplerly formatted fasta.

ADD REPLYlink written 2.4 years ago by sm1576620
1

Maybe it's a problem with your cluster's scheduler, then? It could be that your jobs are timing out, not because of a lack of memory but because they're reached their time limit. This could explain why only the first ~35% of your input FASTAs are being searched.

ADD REPLYlink written 2.4 years ago by jonfoox10

Ah ok - that's not something that I'd have thought of - but it definitely makes sense! It didn't leave any kind of error but I'll emial the sysadmin and ask..

ADD REPLYlink written 2.4 years ago by sm1576620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour