Making python script for downloading genomes for OMA Analysis
1
1
Entering edit mode
2.9 years ago
anasjamshed ▴ 140

I want to make a script that will be able to do the following:

  1. Given a list of NCBI species ID, download all genome assemblies for these species. Edit: please see comments for definition of task
  2. Run OMA standalone on these downloaded genomes to infer hierarchical orthology groups.
  3. Add GO annotations to all loci used in OMA analysis.

My plan is to use biopython to fetch the species, then run pyham(https://lab.dessimoz.org/blog/2017/06/29/pyham) to infer hierarchical orthology groups and then use goatolls(https://github.com/tanghaibao/goatools) to add GO annotations.

Is this possible by using all these 3? or should I do something else?

orthology python OMA • 3.1k views
ADD COMMENT
1
Entering edit mode

You want to do orthologue identification with OMA and therefore the first task you describe above needs some correction to be successful:

  1. Given a list of species, according to documentation you have to download a proteome annotation file in FASTA format for each. In particular, you do not need all or any assemblies per species, but the single representative proteome file. The filename should be the name of the genome.
  2. Run OMA or another software for orthologue identification on these files as described in the software's documentation

I have a simple shell script that can download the proteome of the representative genome automatically if it exists. For genomes where the gene predictions pipeline has not been run, it cannot give you anything, however.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I want to fetch genomes from NCBI nor from oma

ADD REPLY
0
Entering edit mode

First i want to put ncbi species id to download genomes

ADD REPLY
1
Entering edit mode
2.9 years ago
Michael 55k

The following shell script can download the genome assembly and proteome file from NCBI for a list of species names. It does a little bit more than is needed for step 1. but you will figure that out. Be careful, there is not much error checking, so if you have typos in the species list or a species doesn't have an annotated proteome, this may fail miserably. It also leaves the results from the Entrez queries around for your records.

You need to have Entrez e-utils installed in your path.

I am ignoring the python tag here because it is not important to make things work.

ADD COMMENT
0
Entering edit mode

i need to do it either by python or R

ADD REPLY
2
Entering edit mode

Why, if it works? After you download everything, the next step is to invoke OMA via the command line. Whether you wrap this process in python or R makes no difference. Of course, you can write similar code like the above in python or R. For R, there is the package biomartr which can download genomic data from different sources. For python, there should be a solution in biopython, and a related question on Biostars here: Download NCBI genome sequences from Python

Possibly, someone else can help you with such an implementation, but it won't be substantially easier or less error-prone than using my script.

ADD REPLY
2
Entering edit mode

It is definitely possible but futile for OMA analysis. There is no genome annotation for Abrostola tripartite hence no proteome, and the other links point to multiple taxa. I am not sure why you insist on Python (guessing 'assignment' or 'order from your boss'), but if you need such a python solution, I cannot help you. I am bumping this post to allow others to see it and possibly help out, but I personally think that it is best to approach the problem in a solution-oriented, not in a tool-centric way.

ADD REPLY
0
Entering edit mode

Is this doable by R?

ADD REPLY
0
Entering edit mode

Yes definitely :)

ADD REPLY
0
Entering edit mode

I mean using R against any genome present in links?

ADD REPLY
1
Entering edit mode

If you want to infer OMA HOGs you will need to have the protein sequences for all your genomes. Either you restrict yourself to only genomes that have already annotated protein sequences available, or you first need to infer them yourself. There are tons of tools and pipelines for that, but it won't be easy very easy to do.

Michael's script is very helpful to download the genomes and also the protein sequences if available. You shouldn't insist on it being a python script in my view. His code makes use of the EntrezTool from NCBI, which is perfect. Biopython has also a wrapper to it, so you could rewrite Michael's script in python if you (or your boss) insists.

To download the genomes from OMA, you also have an export function ( https://omabrowser.org/export ) where you can select your genomes of interest and export a tarball including oma standalone and the precomputed All-vs-All homology search files.

Cheers Adrian

ADD REPLY
0
Entering edit mode

[..] you also have an export function ( https://omabrowser.org/export ) where you can select your genomes of interest and export a tarball including oma standalone and the precomputed All-vs-All homology search files

More on this in How to build phylogenetic species trees with OMA - (Protocol 2). Hth.

ADD REPLY

Login before adding your answer.

Traffic: 1429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6