Question: how to classify large scale orf data
0
gravatar for akhilvbioinfo
3.6 years ago by
akhilvbioinfo130
India, chennai
akhilvbioinfo130 wrote:

haii

i have all possible orf region of a bacterial species (about 50 mb). then i want to annotated which protein related to these orfs.. any tool is available for to do this( instead of blast)   

tool sequence assembly gene • 870 views
ADD COMMENTlink modified 3.6 years ago by Michael Dondrup45k • written 3.6 years ago by akhilvbioinfo130

For full-length gene/protein homology search, BLAST is the gold standard - or at least it is considered as such by many. Alternatively, you can also annotate protein domains (e.g., PFAM) using hmmer oder InterPro Scan. Via these annotations you may be able to find other proteins that are likely homologous by comparing the protein domain composition. But I do not know, if this is what you desire to do? What is your reason against BLAST? Probably, a better solution can be found if you specify your problem more exactly...

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Manuel Landesfeind1.2k

actually i have about 284517 orfs so its not possible to analyse in on-line 

ADD REPLYlink written 3.6 years ago by akhilvbioinfo130

and also classify these proteins as hypothetical or not  

ADD REPLYlink written 3.6 years ago by akhilvbioinfo130

Why not run BLAST locally on your facility/institute? In that case you could also create your own BLAST database containing only bacterial species, which are less sequences than in the full NCBI-NR database. With such a restricted database you should definitively be able to execute BLAST locally, or?

ADD REPLYlink written 3.6 years ago by Manuel Landesfeind1.2k

BLASTP of 300k ORFs against nr split over 128 cores (16 x 8 thread blasts) would take about one week. You could speed this up tremendously by decreasing the number of query sequences through pre-clustering at high percent identity like say 85-95%. Alternatively, you could try a faster BLAST-like program like e.g. USEARCH or DIAMOND (claim 20k speed up over BLAST). Another way to speed things up is picking a smaller reference database like e.g. refseq_protein or uniref50/90.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by 5heikki8.3k
0
gravatar for Michael Dondrup
3.6 years ago by
Bergen, Norway
Michael Dondrup45k wrote:

The standard way of analysis is to run a bacterial gene-predictor e.g. glimmer, and then analyze only the ORFs that are predicted to be coding.
 

ADD COMMENTlink written 3.6 years ago by Michael Dondrup45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1897 users visited in the last hour