Question: blastp command line not reporting hits
0
gravatar for egerrick
11 months ago by
egerrick0
egerrick0 wrote:

Hi,

I've been running blastp using a large number of queries on a cluster, but am having an unexpected issue of very obvious hits not being reported in the output. Essentially I did RNA-Seq and de novo transcriptome assembly on an organism of interest (with no sequenced genome) straight out of a mouse, and am trying to use BLAST to remove mouse genes and annotate the remaining genes. So I have a list of protein sequences from this (~40,000 proteins) of which many are mouse contamination, and others are from my organism of interest. To determine which are mouse, I created a blast database from all the mouse proteins on uniprot (~100,000 proteins), and then ran the following command on a cluster:

blastp -query my_sequences.fasta -num_threads 16 -db uniprot-mouse -max_target_seqs 1 -outfmt "6 std stitle" -evalue 1e-30 -word_size 7 > output.txt

I used word size of 7 and a low e-value since I expect contaminating mouse proteins to align with almost perfect identity.

My problem is that the output is missing a lot of hits. I've found many query proteins that align with near-perfect identity to mouse proteins that were simply not listed in my blast output. When I blast them on the webserver they are identified with evalues well below my 1e-30 threshold, and I've double checked my mouse protein database and these proteins are in there. Does anyone have any idea why this might be happening?

For extra information, I'm running this job on a cluster with the following parameters:

--time=48:00:00
--cpus-per-task=16
--mem=32000
--mail-type = FAIL

And am using blast+/2.7.1

And I am not getting a failure email, so it doesn't look like the run is simply timing out and thus truncating my output.

Thank you very much, any insights would be extremely appreciated!

blast software error • 392 views
ADD COMMENTlink modified 11 months ago by RamRS27k • written 11 months ago by egerrick0
1

Couple of things you could try.

  1. Use blat to search your sequences against mouse proteome. As long as your assembly worked reasonably well, you should be able to identify those proteins and then remove them.
  2. You could use bbmap.sh from BBMap suite to align to align your original reads against the mouse genome and then collect reads that do not map. Then repeat the transcriptome assembly on just those reads. Those should be things you are interested in.

I would recommend #2.

ADD REPLYlink modified 11 months ago • written 11 months ago by genomax85k

Not sure if this will solve your issue, but these two papers may help you understand your issue a bit more.

Paper1 Paper2

ADD REPLYlink written 11 months ago by Giovanni.madrigal12160

Thank you very much for your reply. My understanding from the issue addressed in those papers is that the "max_target_seqs" parameter can, in certain situations, sometimes choose hits to output in an arbitrary manner (i.e. based on their order in the database). But my issue is that there is simply no hit in the output at all, despite an almost perfect match being present in the database.

ADD REPLYlink written 11 months ago by egerrick0

If you are missing a lot of hits, but getting some, then is just a sensitivity problem.

Try to understand what parameters are used on the webserver and use exactly those. Otherwise, you can just slightly relax your evalue.

ADD REPLYlink written 11 months ago by Fabio Marroni2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 839 users visited in the last hour