Hi, I have been having some difficulties with blasts I am running on several species of fish. I have been trying to combine the FASTA files as it will make my blasts much easier but have noticed some oddities. If I run a blast on a database of two fish the results I get appear to be missing matches I get when I run a blast on the two fish as individuals. The code I used for the combined blast and individual are below and both do yield results, but it seems odd to me that there sometimes isn't overlap with individuals when there should be.
blastn –db lamprey_zebra.fasta –query zAsic2.fasta -task blastn -outfmt 7 -max_target_seqs 100 blastn –db zebra.fasta –query zAsic2.fasta -task blastn -outfmt 7 -max_target_seqs 100
I thought this may be an issue as I manually combined the FASTA files by converting to .txt copy and pasting and then converting back to .fa. So I tried to combined through powershell using
cat *.fa > combined.fa
As I had seen this was used to combine FASTA files by others. However, when I try to convert that into a database I get the error
Blast option error: file db\combined.fa does not match the input format type, default input type is FASTA. So I am currently stuck as my manually combined FASTA files are not recognizing matches that should be there and combining them another way has also hit a roadblock. Any help would be greatly appreciated!
Thanks very much for your response. In general the returns I am getting are 5 or less hits so It's not due to too many hits, I was wondering what the code to lift my e-value cut off? Ideally I'd want to combine them as I have to run a lot of blasts against the species. I'll have a go with blastdb_aliastool as this looks like it could hopefully be what I'm looking for. Cheers, D
I would simply try with the aliastool combined db first, if you are still missing potential hits form the combined vs. each alone try parameter
-evalue 100even though this then is a pretty bad hit. Also, try to remove the -max_target_seqs parameter anyway, default is 500. The documentation of this parameter says: "Ties are broken by order of sequences in the database." So this is the most likely culprit.
If that doesn't help, I would need to see the exact command calls and blast versions used, starting from makeblastdb, blastdb_aliastool, and blast calls. Ideally also some reproducible input data or examples of alignments found in one setting but not the other.
Thank you very much Michael. I will be able to give these steps a go over the weekend and hopefully it will resolve the issues! I'll let you know how it goes either way! Cheers, D
Thanks again for your help - I am unfortunately still having some difficulty as even the database combined using
blastdb_aliastoolis only returning results from one of the databases included and changing the evalue inflates the matches beyond what I had expected.
First I created 5 databases using:
Next I combined them using:
Finally I blasted the combined database using:
and the individuals using:
The combined database only returned the same results as the elasmo_full (4 hits) when the combined should be returning a total of 21 which was the sum of all the individual blasts.
If I change the evalue to 100 then I get 43 hits. I guess I don't understand why the change is so different between the individuals as elasmo_full is by far the biggest file containing 9815 sequences, while the rest all contain 26,26,13,13. The % identity matches are all above 80%, with most being 90-100%. If I inflate the evalue, would the results not be considered very valuable or noteworthy? As an additional note an evalue of 23 gives 16 returns and the next jump is 78 which moves to the 43 returns.
Any further insight would be greatly appreciated!
Ok, so we are nearing to build a minimal reproducible case here. Some more things to try:
Edit: I think I got it: it is in fact due to the imbalanced size of your db's. If you combine a large database and a few small ones, the large db predominantly determines all E-values. When you blast against a tiny db essentially any hit has E-value close to 0, when combined with the large db, the same sequence now gets scored with the size of the much larger db. Even though it might be a little tricky to calculate exact e-values from blast, the sequence sizes (query and db) factor in as factors m,n in the formula (see here: Blast E Value Calculation )
E = K m n exp (- lambda S)
For example, if your largest database is ~1000x the size of the smallest, after combining them, the E-values for hits to the smallest db would be >1000x inflated, while the E-values of hits to the largest db would stay almost the same.
Unfortunately, that doesn't leave you many options, except balancing the db sizes by adding more sequences or filtering on score, identity, and query coverage instead of e-value
Could you please, build all databases with
-parse_seqids. That may give important hints during db building and makes it easier to extract hits. (sorry, you did this)
Could you change the output format to standard (0) and inspect the different hits. Do you notice anything, like complete duplicates?
After you run blastn separately, please extract all the sequences that have a hit, and inspect the sequences. Do you see any pattern, like same identifiers, identical sequences? Build a blast database only containing the hit sequences, does the problem persist?
Is all this from public data, in that case, could you share all the fasta files on figshare?
Alternatively, share the genbank/refseq genome ids for the species genomes, but only if they were not further in any way (except renamed) processed after download. Still the query file would be required.
And I think elasmo_full refers to all sharks, so this file may already be a compound file, so I need this too.
Hi Michael, Thanks for your response - after playing around with the db's more from reading your comments I believe you are correct that it is the significant size difference that is causing the issue with the matches. I have decided to try to filter on other measures as you suggested but wanted to double check score referred to bitscore (and what kind of score I would be aiming for?) and what query coverage is?
The table headings I get currently from outputs are: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue and bit score. Thanks again - your insights on this have been extremely helpful.
Hi, great it helps. So, yes I think bit score should be fine, from the Blast book:
Most importantly, it doesn't depend on the database size.
Regarding the output format, I think at least you should include
qcovsfor query coverage. I also regularly recommend to use generic asn1 as output format, then further use blastformatter to convert the output into the exact format required, especially for long running jobs, as there is no way to restore the alignment from tabular blast format. It might not be necessary for your case, but if you for example wish to inspect the alignments using default text format, you have to run blast again, if not using asn1.
I hope this helps and good luck with your project.
Hi Michael, I was just curious about what qcov is actually trying to tell me and the numbers I am getting. All my qcov scores are 3 or less, but the % identity is mostly 100. Is this because my query sequence is much smaller than the whole genomes I am running it against? I am trying to match small genetic sequences known to code for certain things from other species against whole genomes of sharks (and testing it against whole genomes of related species too). Interestingly when I run the small sequence against the whole genome of the species it is taken from I get a bitscore of 22.9 and a qcov of 0 (a lot of the hits from sharks are getting higher bit scores and qcovs of 1-3). Let me know your thoughts and thanks again for your guidance and help! Cheers, D
Sorry I made a mistake, the name of the parameter
qcovsnot qcov and should be a percentage. Could you try again? For an explanation see BLAST definition and difference between 'qcovs' and 'qcovhsp'
Using the following format modifier
% query coverage per subject appears in the last column.
Hi Michael, I figured out that the databases I was using were only partial sequences and not the complete genome - now that I have the complete genome I am getting more appropriate scores! Thanks again for your help!