Question

How to accelerate BLASTING

0

Entering edit mode

10 weeks ago

anikcropscience ▴ 230

Hi, I have 10000 unassembled contigs from a metagenomic analysis. I have no idea which sequence belongs to which species. I ran Kraken classification but that is not enough as I still have half of the reads unclassified. I have collected the raw material for nanopore sequencing from an infected plant. I do not know what kind of pathogen (virus, fungus, or bacteria) caused it. So, I tried to blast each contig remotely using the following command:

blastn -query filtered.assembly.not.aligned.fasta -remote -db nr -out blastoutput_unassembled.txt -outfmt '6 qseqid sseqid evalue bitscore sgi sacc staxids sscinames scomnames stitle' -max_target_seqs 1

But this process has been running for the last 7 days and still only results from 3000 contigs are available.

Could you please suggest if this process can be accelerated or any other alternative solution for the purpose?

NCBI Blast sequence • 690 views

ADD COMMENT • link 10 weeks ago by anikcropscience ▴ 230

1

Entering edit mode

Kraken (or I think Centrifuge?) would be the way to go IMO. They are specifically designed for this task, which BLAST isn't really. Check you are using the latest database versions etc. I could be wrong but I would expect them to be using datasets close to if not the same completeness as NR.

The most immediate answer to your question though is : don't use remote blast. Install a local copy.

ADD REPLY • link 10 weeks ago by Joe 21k

0

Entering edit mode

Thank you for the suggestion. Do you know of any NCBI database only for bacteria, viruses, and fungi? I have run Kraken and did not get what I was looking for since many of the reads were unclassified.

ADD REPLY • link 10 weeks ago by anikcropscience ▴ 230

0

Entering edit mode

Which kraken database did you use? I agree that running a local version of BLAST should dramatically improve runtime, but I think I would alter approach a bit. If majority if your reads are not being classified as expected by a tool like Kraken, I would look closely into a handful of the unclassified reads. BLAST these, and look at all other databases available.

It sounds like you may have a contamination issue. If you used the correct one, the Kraken databases are very good at general classifications, so if samples are being unclassified, it sets off a few alarm bells for me.

ADD REPLY • link 10 weeks ago by dthorbur ★ 1.9k

0

Entering edit mode

I used the Standard and Viral database from this source https://benlangmead.github.io/aws-indexes/k2

I have around 120K reads and 50% of those were classified and the other half remained unclassified.

ADD REPLY • link 10 weeks ago by anikcropscience ▴ 230

0

Entering edit mode

NCBI databases aren't broken down in that way as such, you have to filter by taxonomic ID numbers.

ADD REPLY • link 10 weeks ago by Joe 21k

score 2 · Answer 1 · 2024-02-15

2

Entering edit mode

10 weeks ago

pbioinf ▴ 70

I've had fairly pleasant experiences using BAT/CAT https://github.com/dutilh/CAT. It is Diamond based so should be faster than BLAST while using the same logic. You'll still need a decently powerful HPC for it.

ADD COMMENT • link 10 weeks ago by pbioinf ▴ 70

0

Entering edit mode

Thanks a lot for suggesting this.

ADD REPLY • link 10 weeks ago by anikcropscience ▴ 230

score 1 · Answer 2 · 2024-02-15

1

Entering edit mode

10 weeks ago

Mensur Dlakic ★ 27k

For your specific problem, having larger memory, many CPUs and a faster disk are the only ways to speed up the process.

Another way of classifying contigs without BLASTing is to bin the contigs and then annotate using GTDB classification:

This typically takes hours or a day at most.

ADD COMMENT • link 10 weeks ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Oh that is cool to know. Thank you. I will check it out.

ADD REPLY • link 10 weeks ago by anikcropscience ▴ 230