Blast 2.11 not blasting all sequences in a file?
1
0
Entering edit mode
15 months ago
DNAngel ▴ 210

Hi all,

Just wondering if anyone has experienced a similar issue. I'm using blast+ version 2.11 on my school's cluster to blast fasta files of size about 23-60 Mb (~1000-4000 sequences).

Specifically I am using blastn with the nt database.

What is weird is that once my blast is done, sequences are missing. For example I am blasting contigs from metaspades and I notice that if my contig file has 1000 sequences, blast will only give me hits for 500 of them. I run it a few times and it is always the same contigs that are missing from blast. I am not using any thresholds because I just want to see what is being matched to these sequences. I thought that maybe these sequences just don't have any hits - so I extract the missing sequences separately from the contig file and blast a few of them but I do get a result!

Why is blastn just skipping some contigs entirely?? The exact command I am using is below:

blastn -db nt -num_threads 32 -max_target_seqs 1 -max_hsps 1 \
-outfmt "6 qseqid sseqid qlen slen pident evalue score staxids stitle" \
-query contig.fasta > contig_blast.output && echo "DONE" contig_blast.output


I just want the first top hit for each of my contigs but I need it for all the contigs in the file.

blast • 929 views
0
Entering edit mode

Hi, the number 500 is oddly specific one and seems like some threshold is in play. The blastn has two settings which are 500 by default: num_descriptions and max_target_seqs (from the https://www.ncbi.nlm.nih.gov/books/NBK279684/). Btw, do those "missing" hits have higher or lower bitscore when you find them separately, then those reported in the big run?

0
Entering edit mode

That was jsut an example but it isn't always 500. In general it just seems like half are being blasted and half are being skipped but it isn't a strict 50% reduction. They have high bitscore and in fact, I thought the contig length was coming into play and maybe these were super long - but the ones missing are often the shorter contigs. They also have high percent identity and good evalue scores so I really have no explanation as to why these are being skipped...

0
Entering edit mode

What happens to the output when you get rid of && echo "DONE" contig_blast.output? Perhaps the installation is buggy? Have you tried grabbing a version off of conda? Or try using another sequence search tool (e.g., MMseqs2).

0
Entering edit mode

I did do a run before without "DONE" but I thought maybe I was just missing an error and needed some validation that blast was finishing fine. I added the "DONE" to ensure this. Looks like it didn't matter because regardless, sequences are being skipped. Unfortunately our servers don't support conda so I can't grab it off that. Never heard of MMseqs2 but I'll check it out thanks!

0
Entering edit mode

Your servers don't support conda? As in the user cannot even write anything, even to their own directories? That doesn't make any sense.

0
Entering edit mode
15 months ago

There is a very simple explanation: there could be no hits for the contigs that are missing in the output.

I run it a few times and it is always the same contigs that are missing from blast.

It is almost certain that this is the case. (Blast runs should be deterministic in that regard.)

so I extract the missing sequences separately from the contig file and blast a few of them but I do get a result!

This is very hard to believe, the size of the query should not influence the result in any way, could you give an example? IF that is the case it looks almost like a bug specific to this version to me.

Using format 6 you cannot see those missing contigs, use blast format 7 to see those in a tabular format (including # comments for each processed query, stating # 0 hits found for no hits) or better use asn1 format and convert into any format afterwards. Also, using only the best hit for taxonomic classification is not recommended.

0
Entering edit mode

I know it is not recommended but I don't need the top 500 hits for each contig when I have over 1000s of contigs for over 100 samples. What is the point of -max_hsps and -max_target_seqs if we're never allowed to use them?

1
Entering edit mode

You might try to keep at least 5 to 10 best hits, but that is a different problem. Let's try to focus on why you don't get hits for some contigs when blasting a multi-fasta but do with single sequences. I recommend to run blastn again with -outfmt 7 and see what happens to the missing contigs.

0
Entering edit mode

Server has been slow - I have still been waiting to test this out.

0
Entering edit mode

What is the point of -max_hsps and -max_target_seqs if we're never allowed to use them?

Michael said: "using only the best hit for taxonomic classification is not recommended." (emphasis added by me)

Do you see the point now?

0
Entering edit mode

Not really...If we obtain the top 5-10 hits, they all say the same species. I just need the top hit for it. The species that is MOST likely to be the sequence. In the rare chance that two species can match to it at the exact same percent identity, I would still have to just pick one of them to discuss. What else am I missing here?