Getting BLAST Clusters Classifier performance for Protein classes
1
0
Entering edit mode
9.0 years ago
ddofer ▴ 30

I have a number of protein datasets, that I applied my machine learning method (ProFET) to, for classification (i.e. using machine learning).

The reviewers want a comparison of our methods performance, vs PSI-BLAST.

I have various sets of protein sequences, each in its own multi-fasta file, with each file corresponding to a functional group or class (e.g. "Neuropeptide" or "Not Neuropeptide").

I to do: all vs all BLAST/Psi-blast on the data, then making a number of clusters corresponding to the number of classes, then seeing how well the clusters correspond to each class. (I'll be doing this with a binary classification test case).

I've never used BLAST locally, and I don't know any tools for doing this quickly. I just need to get the all vs all blast, get clusters from the distance matrix, then get the assignments to the clusters (and preferably the statistics).

My whole pipeline is with scikit-learn / python. (I'm a programming n00b).

Anything simple and fast would be great. Emphasis on simple, I just need this as a one-off.

Thank you very much!

proteins machine-learning sequence blast • 1.7k views
ADD COMMENT
0
Entering edit mode
9.0 years ago
mark.ziemann ★ 1.9k

Hi Ddofer, you will need to install blast+ from NCBI for your OS which includes PSI-blast and all other BLAST flavours. Check out the docs for more info. Running this blast job on Linux would look like this:

psiblast -db proteins.fa -query proteins.fa -out result.txt -outfmt 6 -max_target_seqs 500 -num_threads 8

Where proteins.fa is your multifasta file, result.txt is your output file, the output format is tabular, the max number of output sequences is 500, and runs with 8 CPUs. All of these parameters you can modify depending on your system and desired output.

ADD COMMENT

Login before adding your answer.

Traffic: 2130 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6