Hello everyone, I made a blast research and I got the output in the table format 6.
So I got several hits and I would like to retrieve each fasta sequences from the blast database with the start and end corresponding of the hit.
I saw that that the command blastdbcmd should be able do to the trick but I cannot find where you put the blast output to indicate where to take the sequence in the database (with start and end informations).
Thank you for your help.
I don't think you can do exactly what you want to do with
blastdbcmd, as I don't think takes blast output specifically. According to the docs, it looks like you can specify coordinates based on the identifiers though:Extract different sequence ranges from the BLAST databases
The command below will extract two different sequences: bases 40-80 in human chromosome Y (GI 13626247) with the masked regions in lowercase characters (notice argument 30, the masking algorithm ID which is available in this BLAST database) and bases 1-10 in the minus strand of human chromosome 20 (GI 14772189).
I'm rusty when it comes to
blastdbcmdbut I think broadly what you'll need to do is: - Convert blast output to just a list of IDs of interest. - Pass this file with coordinates in a manner that matches the IDs of the sequences you created the database from, and print each sequence accordingly.For instance, in the above, you're replace the
printfcommand with a manipulation of the columnar data from the blastoutput (seecut/awketc).Hi everyone, I have some news concerning the idea you proposed me. The issue is in term on time because if I write a command line for each hit blast, it is about
310 000commande lines to execute in order to recover the fasta sequences in the database corresponding to the range start and end.So I have my file with the 310 000
blastdbcmd -db ....to execute by doingfor line in read -r file_script.txt line ; do command "$line"; doneand it takes a lot a time, at least 4 hours to execute all the blastdbcmd commande in the file.So I was wondering if there where not another faster way to recover a fasta sequences in database with the range start and end ?
Please don’t add answers unless you are actually answering the top level post.
As for your speed issue, retrieving that many hits from a blast database is never going to be especially fast, but you should be able to parallelise the process using
GNU parallelinstead of a loop.You could use a RAM disk or something (if you have plenty of RAM available) and read the database into memory. That would be about the fastest way to do this using blast.
You could also look into using
pyfaidxorsamtools faidx(e.g. Extract User Defined Region From An Fasta File ) with intervals to get the sequence from original database file. This would likely be the fastest way.