Question: OMA-standalone and sequence gaps
gravatar for emmzarrr
4 weeks ago by
emmzarrr0 wrote:


I'm using OMA-standalone on my own data as well as data downloaded from NCBI. I'm working with CDS nucleotide/DNA files. However, these CDS files have some gaps in represented as Ns. I believe this is when annotations are crossing contig boundaries or gaps in the genome assembly.

When running OMA standalone I'm getting many warnings like the ones below, and whilst I know they're probably just because of these gaps, I'm worried this will cause erroneous results, due to the X's being misaligned.

WARNING: IUPAC ambiguity characters for DNA/RNA not supported. Will replace them with 'X'

Pat index with 18353224 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXXXXXXXXXXX"

Pat index with 41395238 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXAAAATATATC"

So my main question is, does OMA standalone account for these gaps, should I just leave the Ns in or is there a better way to go about this? And is using the CDS better than using the full genome?

For context, I'm trying to get orthologous groups to help build a species tree and I'm working with insect genomes.

Thank you, Emma

oma-standalone oma genome • 114 views
ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by emmzarrr0
gravatar for adrian.altenhoff
4 weeks ago by
adrian.altenhoff880 wrote:

Dear Emma,

the IUPAC ambiguity characters are not the Ns (or X) which is the unknown character, but rather R,Y,S,W,K,M,B,D,H and V. These indicate ambiguous nucleotides, e.g. a R indicates that at this position it could be either a A or a G. see

OMA will just replace all of those ambiguous characters to the unknown character. While doing the alignments, the unknown character will be aligned without penalty or score (because it could be anything). however, if a gap needs to be created, this is still penalized.

You definitively want to use the CDS sequences and not the whole genome sequences for OMA. In most cases it is even wiser to us the protein sequence as it is faster and better tested than with the CDS sequences.

I hope this answers your questions. Best wishes, Adrian

ADD COMMENTlink written 4 weeks ago by adrian.altenhoff880

Hi Adrian,

Thanks for your reply. I was confused by this because I checked my sequences for any characters which were not A,T,C,G or N and did not find any. I did try a dos2unix in case, but this didn't make any difference. And I've definitely set "InputDataType := 'DNA';" in the parameters file.

Is it possible it's complaining about the headers in my cds files? I might need to trim them, although they are on a single line, they look like this:

> lcl|NW_003803422.1_cds_NP_018493878.1_374 [gene=LOC100909045] [db_xref=GeneID:100909045] [protein=LOW QUALITY PROTEIN: GMP synthase [glutamine-hydrolyzing]-like] [exception=unclassified translation discrepancy] [protein_id=NP_018493878.1] [location=join(44861..45017,45206..45418,45752..46636,46639..46680,46683..47308,47433..47534)] [gbkey=CDS]

Thank you,


ADD REPLYlink modified 4 weeks ago by GenoMax96k • written 4 weeks ago by emmzarrr0

it might in any way not be a bad idea to shorten those fasta header lines, not sure OMA has a problem with it but I'm sure other programs do (so perhaps here as well thus). Just remove the parts that are not informative for this analysis but do keep them unique of course!

ADD REPLYlink written 4 weeks ago by lieven.sterck10.0k

Hi Emma,

I don't think the length of the fasta header is a problem per se. Please make sure the '>' is really the first character on the line, so no spaces before. if you're still stuck, can you please send me an example directly or paste a link to a file here?

Best Adrian

ADD REPLYlink written 24 days ago by adrian.altenhoff880
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1741 users visited in the last hour