Question: Correct formatting for IDs in OMA standalone .splice files to properly identify splice variants
1
gravatar for eschang1
5 weeks ago by
eschang110
eschang110 wrote:

I see that for the input .splice files, OMA standalone requires that the individual IDs are unique prefixes of your FASTA headers, and proteins that are splice variants of the same gene should be listed in the splice file like "ENSP00000384207; ENSP00000263741; ENSP00000353094".

It looks like NCBI and Ensembl have annotation tables that can be downloaded that will make associating proteins with genes fairly straightforward. In the annotation tables, proteins are usually identified by the shortest version of their name, something like NP_001027594.1.

To keep it brief, will OMA be able to recognize a splice file line like "NP_001027594.1;NP_001027593.1" if the actual FASTA headers are more complicated, i.e. something like:

NP_001027594.1 homeobox transcription factor Pax1/9 [Ciona intestinalis]

NP_001027593.1 DEAD-Box Protein [Ciona intestinalis]

Is this what the manual means by "unique prefixes of FASTA headers"? Just wanted to make sure that I didn't need to reformat my FASTA headers before diving into

And secondly, does the All vs All step use the splice variant information? Or is it possible to do the All vs. All and then try running the OMA orthology algorithms with and without this information?

Thank you so much! Running OMA standalone on some of my own test data sets so far has been super smooth.

Cheers, Sally

orthology oma orthologs • 93 views
ADD COMMENTlink modified 4 weeks ago • written 5 weeks ago by eschang110

Okay great, this all in line with that I had gathered from the manual, but wanted to clarify before I started to put together those .splice files. Thanks so much! Chers, Sally

ADD REPLYlink written 4 weeks ago by eschang110
2
gravatar for adrian.altenhoff
5 weeks ago by
Switzerland
adrian.altenhoff520 wrote:

Hi,

yes, this is exactly what is meant with "unique prefixes of FASTA headers". You don't have to specify the full fasta headers, but the protein ID (or even parts of it) is sufficient if it identifies the protein uniquely.

The All-vs-All step does not make use of the splicing information. We still compute the all-vs-all for all proteins and will only select the best variant (based on the total nr of homologous hits with all other genomes) in the later step. You can turn on or off the UseOnlyOneSplicingVariant, and check for the different output that gets produced. But changing the *.splice files will not work - the internal database will not be updated, or, if previously no splicing file has been defined, invalidated.

In case you realize some problem with the splicing variants after the AllAll, it might still be possible to update it manually. It might become a feature in a future release of OMA standalone.

Cheers Adrian

ADD COMMENTlink written 5 weeks ago by adrian.altenhoff520

Realized I just responded to my own post: Okay great, this all in line with that I had gathered from the manual, but wanted to clarify before I started to put together those .splice files. Thanks so much! Cheers, Sally

ADD REPLYlink written 26 days ago by eschang110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1659 users visited in the last hour