Question

How to perform BLAST search containing a large number of query sequences ?

0

Entering edit mode

9.8 years ago

siddharth.avadhanam ▴ 30

I need to BLAST a large number of query sequences in one go. How might I go about this ? I have a .rtf file containing a large number of query sequences ... in this format.

"MNKNEFTSIEVIPGYLGGKPFIKGTGVRVSEILDLLLAGIS
ILREYPGICNHDIDSAVSFLEAKLEMARQSQYTHEKVS"
"MNHIVYKNLKNYKYQLVKSYNFQTEIKTDLSLKIRKSEVKVFVN
LDPEGLLKIEAGYAWDGPSGPTIDTKTFIRGSLIHDALYQLMREEKLDRIKYRENADQ
LKKICLEDGMNSFRASYVYQFVRWFGESAARPKDESKEWEVAP"

where the sequences are separated by the "s. Any ideas on how I might go about performing BLAST searches on each of them against the same database in one go?

sequence blast • 5.9k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by siddharth.avadhanam ▴ 30

Ram · Answer 1 · 2014-10-08

4

Entering edit mode

9.8 years ago

Michael 54k

Convert your input into FASTA format, then run local blast.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by Michael 54k

0

Entering edit mode

First I would save the RTF file as plain text, and then try to write a script to convert this into FASTA format. You will need to invent identifiers. Watch out for different quote characters (e.g. pretty left and right quotes) which may complicate this.

In future, create a plain text FASTA file directly when ever manually collecting sequences - and give them useful identifiers too.

ADD REPLY • link 9.8 years ago by Peter 6.0k

0

Entering edit mode

I didn't manually collect them. I extracted these sequences from a .gbk file using a python script. I think I've got it appropriately formatted now. Could use some help with the local BLAST stuff though. Any guide/tutorial that you could point me to? The ncbi website has me really confused.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by siddharth.avadhanam ▴ 30

0

Entering edit mode

In that case, I would fix your Python script to get the protein sequences from GenBank files output directly in FASTA format. See e.g. http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/genbank2fasta/

ADD REPLY • link 9.8 years ago by Peter 6.0k

Ram · Answer 2 · 2014-10-08

As said above you need to have your input in FASTA format and set up local blast. Some important tips:

Use the latest BLAST+ executables and not the legacy ones. Those are a lot more up-to-date, have fewer bugs and work better generally.
Be sure you put your queries in one single FASTA file rather than splitting them. This is called query concatenation and it speeds up the searches a lot. See the "Concatenation of queries" section here.
If you want xml output and use query concatenation, beware that there are some inconsistencies in the xml output, which are being discussed right now. Future versions of BLAST+ will probably correct that

Ram · Answer 3 · 2014-10-08

0

Entering edit mode

9.8 years ago

vaskin90 ▴ 290

There are BLAST elements in UGENE Workflow Designer. You can create a scheme with either local or remote BLAST elements and feed it with all the sequences that you want to process.

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.8 years ago by vaskin90 ▴ 290