Hello! I am sorry if this question seems silly, but I wanted to ask if there is a way to limit the result that I get from running the BLAST command on my samples.
I am trying to do determine the microbial composition of environmental samples by sequencing 16S amplicons from environmental samples.
After trimming for the 16S region and doing QC, I converted the FASTQ files to FASTA and used
blastn against the NCBI's 16S Ribosomal RNA database.
After that, I import the blastn files into MEGAN and start my analysis. This is so that I can then inspect/extract the reads associated with the species/genus later.
My question comes from the fact that running thousands of reads through the blastn program lead to VERY large files, with running about 100,000 reads returning files that are more than 250 GB.
The command that I used is as follows:
blastn -db ~/NCBIdb/16S_ribosomal_RNA -query query.fasta -num_threads 12 -out query.fasta.blastn
I tried the
-max_target_seqs option with a value of 100 and compared it to the default 500, and I noticed very big changes to the bacterial composition of my sample.
This led me down the rabbit hole, with Shah et. al. and the NCBI team, and a whole lot of other searching, but I still could not find out whether using the option is advisable or not.
Thus, I was wondering if anyone had tried doing the same thing; is it better to stick to the default 500 or go for a different value? I assumed that the -max_target_seqs option would give me the best hit out of the whole database, but it seems to not be the case. Or is there another way to reduce the computational load and file size of the result? Because I have about 130 samples, all with more than 50,000 reads each.
Thank you in advance,
Edit: Added some information in an attempt to make it clearer.