Question

how to classify large scale orf data

0

Entering edit mode

8.7 years ago

Eva_Maria ▴ 190

Hi

I have all possible orf region of a bacterial species (about 50 mb). Then I want to annotate which protein related to these orfs. Any tool is available for to do this (instead of blast)?

Assembly gene sequence • 1.6k views

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Eva_Maria ▴ 190

0

Entering edit mode

For full-length gene/protein homology search, BLAST is the gold standard - or at least it is considered as such by many. Alternatively, you can also annotate protein domains (e.g., PFAM) using hmmer oder InterPro Scan. Via these annotations you may be able to find other proteins that are likely homologous by comparing the protein domain composition. But I do not know, if this is what you desire to do? What is your reason against BLAST? Probably, a better solution can be found if you specify your problem more exactly...

ADD REPLY • link 8.7 years ago by Manuel Landesfeind ★ 1.4k

0

Entering edit mode

Actually I have about 284517 orfs so it's not possible to analyse on-line

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Eva_Maria ▴ 190

0

Entering edit mode

and also classify these proteins as hypothetical or not

ADD REPLY • link updated 18 months ago by Ram 43k • written 8.7 years ago by Eva_Maria ▴ 190

0

Entering edit mode

Why not run BLAST locally on your facility/institute? In that case you could also create your own BLAST database containing only bacterial species, which are less sequences than in the full NCBI-NR database. With such a restricted database you should definitively be able to execute BLAST locally, or?

ADD REPLY • link 8.7 years ago by Manuel Landesfeind ★ 1.4k

0

Entering edit mode

BLASTP of 300k ORFs against nr split over 128 cores (16 x 8 thread blasts) would take about one week. You could speed this up tremendously by decreasing the number of query sequences through pre-clustering at high percent identity like say 85-95%. Alternatively, you could try a faster BLAST-like program like e.g. USEARCH or DIAMOND (claim 20k speed up over BLAST). Another way to speed things up is picking a smaller reference database like e.g. refseq_protein or uniref50/90.

ADD REPLY • link 8.7 years ago by 5heikki 11k

Ram · Answer 1 · 2015-08-13

0

Entering edit mode

8.7 years ago

Michael 54k

The standard way of analysis is to run a bacterial gene-predictor e.g. glimmer, and then analyze only the ORFs that are predicted to be coding.

ADD COMMENT • link updated 18 months ago by Ram 43k • written 8.7 years ago by Michael 54k