I've encountered a problem when generating PSSM from multiple-sequence query with PSI-BLAST. I want to have a query with single PSSM output for each sequence, or at least a big-single file with every sequence's PSSM output in there.
If it's single sequence query, it would be easy
If it's a multiple sequence query, only the last sequence in the fasta input will be printed the PSSM file. Even though the overall PSI-BLAST output provided all sequence output, with iterations as I specified.
Is there anyway for me to solve this?
EDITTED: I've thought about the idea of running each sequence in a seperate fasta file. But in my case I think it's hard, because if I do so, I need to submit about 15 thousand jobs to my college server, which cost so much resources and affect others' jobs. (end of editted)
My command is something like this:
psiblast -query ./a_multiple_seq_test.fasta -db nr -num_iterations 2 -out_ascii_pssm pssm.chk
Thanks in advance. Any suggestion will be highly appreciated
Have you considered running separate jobs for each of your query sequences?
I've editted the OP. Thanks for you answer :D
If you have 15,000 sequences, then there is no other way but to run those many jobs, if you need a PSSM for each one. If there are redundant sequences then you may be able to remove the redundancy with a tool like CD-HIT and use only the remaining unique sequences.
It's sad to hear that. Because the psiblast output provides output for each sequence, so I just think somehow I can get PSSM for each sequence in a single file. Maybe I have to divide my input in to chunks to submit. Thank you
I am not exactly sure what you are trying to do (you just need PSSM for each sequence?). Perhaps there is a better simpler way.
I don't have the answer but I can understand what you are trying to do. When I needed profiles (I forgod that was HMM or PSSM or both) of large proteome, I submitted 100 thousands of jobs to Sun Grid Engine, then SGE got stucked (Ouch!). At that time, I wrote a script to check the number of jobs in que every 5 minutes and if the number of jobs in que was enough small, submitted (a part of) remained jobs.
Depending on how stringent your search is and how close the matches in BLAST are, it may not take as long as you think. There are BLAST-like tools that run much faster than NCBI BLAST as well. PSI-BLAST may add a bit of extra compute time, but I've BLAST-ed hundreds of thousands of NGS reads in the past, generating 191million blast hits, and it completed in under a week on our private server, only using about half the available cores (~18)