Question

Predicting Gene Ontology function in OMA for large datasets

0

Entering edit mode

3.4 years ago

crl111222 ▴ 10

Hello,

So the thing is I need to predict the Gene ontology for a dataset of sequences, aproximately one and a half millon sequences. This size is too big for the https://omabrowser.org/oma/functions/. Which would be the best way to use OMA for such a dataset?

oma omabrowser gene_onthology • 1.2k views

ADD COMMENT • link updated 3.4 years ago by Adrian Altenhoff ★ 1.1k • written 3.4 years ago by crl111222 ▴ 10

0

Entering edit mode

not sure about the OMA approach (and this thus potentially not directly answering your question) but you can consider to run them through interproscan. That one will also assign GO labels to the input proteins, keep in mind though that running 1,5M proteins through interpro will also take a considerable amount of time.

ADD REPLY • link 3.4 years ago by lieven.sterck 15k

0

Entering edit mode

Yes, hi Lieven, thanks for your answer. I am aware of Interproscan. Currently running my sequences there too. And yes... sadly you are right, it is taking some time

ADD REPLY • link 3.4 years ago by crl111222 ▴ 10

score 1 · Answer 1 · 2020-11-16

1

Entering edit mode

3.4 years ago

Adrian Altenhoff ★ 1.1k

Hi @crl111222

after reading your question, we decided to increase the maximum size of the (gzip-compressed) fasta file for OMA's function prediction to 50MB for now and will increase it further in the future.

In case you observe any problem with it, please get in touch with us again.

Best wishes Adrian

ADD COMMENT • link 3.4 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

Thanks Adrian for you answer. I would like to ask something else. Is there any way to use the standalone version for this task? I believe that the standalone version, for gene ontology propagation, needs to download some annotated genomes. I have found that if I do this the results will not be as good as those obtained from the online version. Probably it must be because the online version has available a much much much bigger set of genome from which to compare and propagate the labels. Is there a way for the stand alone version to perform as well as the online version for this purpose?

ADD REPLY • link 3.4 years ago by crl111222 ▴ 10

1

Entering edit mode

Indeed, the function prediction tool on the website uses the annotations of all the annotated protein sequences in OMA. If you use OMA standalone, you will only be able to use the annotations from the exported genomes. However, if your query species is covered quite well with the set of exported species OMA standalone should also work very well. The biggest difference will be that it then predicts functions from all annotated orthologous sequences, where as the function on the web predicts the annotations from the closest sequence.

ADD REPLY • link 3.4 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

Hi again doctor Altenhoff. I have been trying to upload gzip files of around 28 MB in size however I keep geeting an error that the file is too big (413 Request Entity too large)

ADD REPLY • link 3.4 years ago by crl111222 ▴ 10

1

Entering edit mode

Hi, sorry about this. I forgot one instance where to change the settings. it should work now with files up to 50MB. Best Adrian

ADD REPLY • link 3.4 years ago by Adrian Altenhoff ★ 1.1k

0

Entering edit mode

Sorry again. I was wondering if there is something not working on the website. Whenever I upload compressed files the status will be error, even with small compressed files. "Your dataset is currently being prepared. Its status is "error". Depending on the size of the uploaded dataset, this may take another couple of minutes.". The compressed files I am using have the extension .gz which I believe is the correct one to be uploaded.

ADD REPLY • link 3.4 years ago by crl111222 ▴ 10

1

Entering edit mode

ups, you were right. there was a change in the API of one of the functions we used that no longer supports handling of gziped files. This should be solved now. Currently deploying the updated version, you should be able to finally use the gziped files in a couple of minutes. sorry for the problems this has caused.

ADD REPLY • link 3.4 years ago by Adrian Altenhoff ★ 1.1k