How do you limit the result from the BLAST executable?
1
0
Entering edit mode
8 weeks ago

Hello! I am sorry if this question seems silly, but I wanted to ask if there is a way to limit the result that I get from running the BLAST command on my samples.

I am trying to do determine the microbial composition of environmental samples by sequencing 16S amplicons from environmental samples. After trimming for the 16S region and doing QC, I converted the FASTQ files to FASTA and used blastn against the NCBI's 16S Ribosomal RNA database. After that, I import the blastn files into MEGAN and start my analysis. This is so that I can then inspect/extract the reads associated with the species/genus later.

My question comes from the fact that running thousands of reads through the blastn program lead to VERY large files, with running about 100,000 reads returning files that are more than 250 GB.

The command that I used is as follows:

blastn -db ~/NCBIdb/16S_ribosomal_RNA -query query.fasta -num_threads 12 -out query.fasta.blastn


I tried the -max_target_seqs option with a value of 100 and compared it to the default 500, and I noticed very big changes to the bacterial composition of my sample.

This led me down the rabbit hole, with Shah et. al. and the NCBI team, and a whole lot of other searching, but I still could not find out whether using the option is advisable or not.

Thus, I was wondering if anyone had tried doing the same thing; is it better to stick to the default 500 or go for a different value? I assumed that the -max_target_seqs option would give me the best hit out of the whole database, but it seems to not be the case. Or is there another way to reduce the computational load and file size of the result? Because I have about 130 samples, all with more than 50,000 reads each.

Edit: Added some information in an attempt to make it clearer.

blast • 609 views
0
Entering edit mode

Sorry, but why are you even using BLAST for this? I presume you're trying to align short reads? You should be using something like Kraken2 (against its 16S database) instead. You should be able to find its 16S DBs here.

1
Entering edit mode

Thank you very much for the reply!

I'm sorry for the lack of information. I've sequenced 16S rRNA amplicons from environmental samples and have the FASTQ files as a result of the sequencing. After QCing the FASTQ files, I'm trying to determine what kind of bacteria there are in the sample, so I converted the FASTQs to FASTAs and I ran them against the 16S database to determine the bacterial species.

I'll check Kraken2 out too.

0
Entering edit mode

Thank you for clarifying that, and no worries!! I think Kraken2 would be the right tool here. Good luck!!

1
Entering edit mode
8 weeks ago
gb ★ 2.0k

Besides I am wondering how you will handle 100 results per read you can start with outfmt 6. I can't give a more detailed answer because I dont know your goal. So your command would look like blastn -db ~/NCBIdb/16S_ribosomal_RNA -query query.fasta -num_threads 12 -outfmt 6 -max_target_seqs 100 -out query.fasta.blastn

Mostly you don't just convert the FASTQ to FASTA with this kind of analyses. You first do quality and primer trimming, merge the forward and reverse reads (if possible), remove duplicates, use a cluster tool and then blast the clusters.

I personally like the the tools of usearch en vsearch and especially UNOISE:

https://drive5.com/usearch/manual/pipe_otus.html

https://drive5.com/usearch/manual/unoise_algo.html

Most other people use qiime (https://qiime2.org/)

0
Entering edit mode

Hello gb,

Thank you for the reply! Thank you for the idea on outfmt 6 and the tools recommendation.

I assumed that if I did not specify the output format, it would automatically be outfmt 6, but I'll try running the command with my downstream analysis. There has been debate on the usage of -max_target_seqs and I wonder if changing the output format would change anything?

1
Entering edit mode

The hits that you get won't change. As far as I know MEGAN can use the fmt 6 output. The LCA algorithm of MEGAN also only takes the top % hits based on a calculation with the bitscore (So maybe you can even do less than 100). Here you can find the blast defaults https://www.ncbi.nlm.nih.gov/books/NBK279684/. Good luck

0
Entering edit mode

Yes you are right that changing the option -max_target_seqs will change the results compared to default but changing the output format won't have any effect on the results. When you are working on large NGS data set like environmental sample, it is not even preferable to work with 100 tophits as it will increase overall run time along with output file size. If you want read wise hits better use diamond which is less sensitive compare to blast but much faster compare to blast making it preferable tool for metagenome analysis. There are 5 outfmt option in diamond which you can explore. You can then give diamond's output to MEGAN for further analysis.

0
Entering edit mode

diamond will not be usefull because the OP is not doing metagenome analysis. He/she is looking at a specific barcode (amplicon sequencing). diamond uses amino acid sequences as input (unless they changed it) and when translating a 16S sequence you would loose to much information. Besides the reference sequences are nucleotide sequences.

0
Entering edit mode

Ya true. I over looked that information that he is working with 16S amplicon and not metagenome.

0
Entering edit mode

Thank you gb and toralmanvar for the insights. You are right, in that my current workflow (blastn > megan) is consuming a lot of computational resources and the space. I thought that since I only amplified the 16S region of my environmental sample with PCR before sequencing, I could get away with a smaller -max_target_seqs but due to the heuristic approach of BLAST, it seems that that may not be the case. I suppose it is further compounded by the fact that I used single molecule sequencing, with it's error rate, I need more weight with the hits for each read.

Is there a way to reduce the computational load for this sort of workflow?

1
Entering edit mode

Better use complete 16S amplicon analysis pipeline like QIIME2 or as already mentioned by gb you can follow this:

Mostly you don't just convert the FASTQ to FASTA with this kind of analyses. You first do quality and primer trimming, merge the forward and reverse reads (if possible), remove duplicates, use a cluster tool and then blast the clusters.

Upto cluster generation stage you can again use QIIME2 and the representative set thus generated can be searched for similarity against NCBI's 16S database. Using only representative set will reduce the number of sequences for blast to many folds.

1
Entering edit mode

This question was already answered. You have to do a few steps before blasting. Because you did a PCR you have many duplicate sequences as well. Also dont forget to trim of the 16S primers (important). Another reason it does not make sense to blast your raw reads is because there will be a lot of "junk" reads that you dont need (to short, PCR errors, sequence errors, chimera's etc) that need to be filtered out for a good and useable end result even if computational load does not matter. I gave a simple explanation about dereplication a few years back: Removing Duplicates before aligning maybe it helps you to.