Use BLAST Command Line Applications on combined FASTA files
2
0
Entering edit mode
29 days ago
daver.v ▴ 20

Hi, I have been having some difficulties with blasts I am running on several species of fish. I have been trying to combine the FASTA files as it will make my blasts much easier but have noticied some oddities. If I run a blast on a database of two fish the results I get appear to be missing matches I get when I run a blast on the two fish as individuals. The code I used for the combined blast and individual are below and both do yield results, but it seems odd to me that there sometimes isn't overlap with individuals when there should be.

blastn –db lamprey_zebra.fasta –query zAsic2.fasta -task blastn -outfmt 7 -max_target_seqs 100

blastn –db zebra.fasta –query zAsic2.fasta -task blastn -outfmt 7 -max_target_seqs 100).

I thought this may be an issue as I manually combined the FASTA files by converting to .txt copy and pasting and then converting back to .fa. So I tried to combined through powershell using

cat *.fa > combined.fa

As I had seen this was used to combine FASTA files by others. However, when I try to convert that into a database I get the error "Blast option error: file db\combined.fa does not match the input format type, default input type is FASTA". So I am currently stuck as my manually combined FASTA files are not recognizing matches that should be there and combining them another way has also hit a roadblock. Any help would be greatly appreciated! Thanks, D

Line code Beginner Applications BLAST command • 743 views
2
Entering edit mode
29 days ago

When you combine databases, you get a larger database and larger databases lead to larger (as of worse) E-values. In addition, when using -max_target_seqs and one species has many good hits it may have used all the 100 hits before the other ones get any. In addition blast may do more adjustments. There is nothing wrong with running multiple blasts against single species if your task is exploratory. Otherwise just lift your e-value cutoff.

For your second question, for combining blast databases better use blastdb_aliastool. You can save the error-prone conversion steps and space for the intermediate data, and can still use the combined or each individual db. In your case, it might for example be a problem with unix vs. windows line breaks, or a *.fa file could not contain what you think it does.

0
Entering edit mode

Thanks very much for your response. In general the returns I am getting are 5 or less hits so It's not due to too many hits, I was wondering what the code to lift my e-value cut off? Ideally I'd want to combine them as I have to run a lot of blasts against the species. I'll have a go with blastdb_aliastool as this looks like it could hopefully be what I'm looking for. Cheers, D

1
Entering edit mode

I would simply try with the aliastool combined db first, if you are still missing potential hits form the combined vs. each alone try parameter -evalue 100 even though this then is a pretty bad hit. Also, try to remove the -max_target_seqs parameter anyway, default is 500. The documentation of this parameter says: "Ties are broken by order of sequences in the database." So this is the most likely culprit.

If that doesn't help, I would need to see the exact command calls and blast versions used, starting from makeblastdb, blastdb_aliastool, and blast calls. Ideally also some reproducible input data or examples of alignments found in one setting but not the other.

0
Entering edit mode

Thank you very much Michael. I will be able to give these steps a go over the weekend and hopefully it will resolve the issues! I'll let you know how it goes either way! Cheers, D

0
Entering edit mode
26 days ago
daver.v ▴ 20

Hi Michael, Thanks again for your help - I am unfortunately still having some difficulty as even the database combined using blastdb_aliastool is only returning results from one of the databases included and changing the evalue inflates the matches beyond what I had expected. First I created 5 databases using:

makeblastdb -in db\elasmo_full.fasta -parse_seqids -blastdb_version 5 -title "elasmo_full" -dbtype nucl

makeblastdb -in db\Hagfish.fasta -parse_seqids -blastdb_version 5 -title "Hagfish" -dbtype nucl

makeblastdb -in db\Lamprey.fasta -parse_seqids -blastdb_version 5 -title "Lamprey" -dbtype nucl

makeblastdb -in db\mugloogobius_abei.fasta -parse_seqids -blastdb_version 5 -title "mugloogobius_abei" -dbtype nucl

makeblastdb -in db\Zebrafish.fasta -parse_seqids -blastdb_version 5 -title "Zebrafish" -dbtype nucl

Next I combined them using:

blastdb_aliastool -dblist "elasmo_full.fasta Hagfish.fasta Lamprey.fasta mugloogobius_abei.fasta Zebrafish.fasta" -dbtype nucl -out fish_all -title "fish_all"

Finally I blasted the combined database using:

blastn –db fish_all –query zAsic2.fasta -task blastn -outfmt 7

and the individuals using:

blastn –db elasmo_full.fasta –query zAsic2.fasta -task blastn -outfmt 7

blastn –db Hagfish.fasta –query zAsic2.fasta -task blastn -outfmt 7

blastn –db Lamprey.fasta –query zAsic2.fasta -task blastn -outfmt 7

blastn –db mugloogobius_abei.fasta –query zAsic2.fasta -task blastn -outfmt 7

blastn –db Zebrafish.fasta –query zAsic2.fasta -task blastn -outfmt 7

The combined database only returned the same results as the elasmo_full (4 hits) when the combined should be returning a total of 21 which was the sum of all the individual blasts. If I change the evalue to 100 then I get 43 hits. I guess I don't understand why the change is so different between the individuals as elasmo_full is by far the biggest file containing 9815 sequences, while the rest all contain 26,26,13,13. The % identity matches are all above 80%, with most being 90-100%. If I inflate the evalue, would the results not be considered very valuable or noteworthy? As an additional note an evalue of 23 gives 16 returns and the next jump is 78 which moves to the 43 returns. Any further insight would be greatly appreciated! Many Thanks, D

1
Entering edit mode

Ok, so we are nearing to build a minimal reproducible case here. Some more things to try:

Edit: I think I got it: it is in fact due to the imbalanced size of your db's. If you combine a large database and a few small ones, the large db predominantly determines all E-values. When you blast against a tiny db essentially any hit has E-value close to 0, when combined with the large db, the same sequence now gets scored with the size of the much larger db. Even though it might be a little tricky to calculate exact e-values from blast, the sequence sizes (query and db) factor in as factors m,n in the formula (see here: Blast E Value Calculation )

E = K m n exp (- lambda S)

For example, if your largest database is ~1000x the size of the smallest, after combining them, the E-values for hits to the smallest db would be >1000x inflated, while the E-values of hits to the largest db would stay almost the same.

Unfortunately, that doesn't leave you many options, except balancing the db sizes by adding more sequences or filtering on score, identity, and query coverage instead of e-value

• Can you get a difference in hits with only using two databases? Would make it easier to debug.
• Could you please, build all databases with -parse_seqids. That may give important hints during db building and makes it easier to extract hits. (sorry, you did this)

• Could you change the output format to standard (0) and inspect the different hits. Do you notice anything, like complete duplicates?

• After you run blastn separately, please extract all the sequences that have a hit, and inspect the sequences. Do you see any pattern, like same identifiers, identical sequences? Build a blast database only containing the hit sequences, does the problem persist?

• Is all this from public data, in that case, could you share all the fasta files on figshare?

Alternatively, share the genbank/refseq genome ids for the species genomes, but only if they were not further in any way (except renamed) processed after download. Still the query file would be required.

And I think elasmo_full refers to all sharks, so this file may already be a compound file, so I need this too.

0
Entering edit mode

Hi Michael, Thanks for your response - after playing around with the db's more from reading your comments I believe you are correct that it is the significant size difference that is causing the issue with the matches. I have decided to try to filter on other measures as you suggested but wanted to double check score referred to bitscore (and what kind of score I would be aiming for?) and what query coverage is?

The table headings I get currently from outputs are: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue and bit score. Thanks again - your insights on this have been extremely helpful.

Cheers,

D

0
Entering edit mode

Hi, great it helps. So, yes I think bit score should be fine, from the Blast book:

bit score - The bit score, S', is derived from the raw alignment score, S, taking the statistical properties of the scoring system into account. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.

Most importantly, it doesn't depend on the database size.

Regarding the output format, I think at least you should include qcovs for query coverage. I also regularly recommend to use generic asn1 as output format, then further use blastformatter to convert the output into the exact format required, especially for long running jobs, as there is no way to restore the alignment from tabular blast format. It might not be necessary for your case, but if you for example wish to inspect the alignments using default text format, you have to run blast again, if not using asn1.

I hope this helps and good luck with your project.

0
Entering edit mode

Hi Michael, I was just curious about what qcov is actually trying to tell me and the numbers I am getting. All my qcov scores are 3 or less, but the % identity is mostly 100. Is this because my query sequence is much smaller than the whole genomes I am running it against? I am trying to match small genetic sequences known to code for certain things from other species against whole genomes of sharks (and testing it against whole genomes of related species too). Interestingly when I run the small sequence against the whole genome of the species it is taken from I get a bitscore of 22.9 and a qcov of 0 (a lot of the hits from sharks are getting higher bit scores and qcovs of 1-3). Let me know your thoughts and thanks again for your guidance and help! Cheers, D

0
Entering edit mode

Sorry I made a mistake, the name of the parameter qcovs not qcov and should be a percentage. Could you try again? For an explanation see BLAST definition and difference between 'qcovs' and 'qcovhsp'

Using the following format modifier

    -outfmt "7 std qcovs"


% query coverage per subject appears in the last column.

0
Entering edit mode

Hi Michael, I figured out that the databases I was using were only partial sequences and not the complete genome - now that I have the complete genome I am getting more appropriate scores! Thanks again for your help!

Cheers,

D