OMA-standalone and sequence gaps
1
0
Entering edit mode
12 months ago
emmzarrr • 0

Hi,

I'm using OMA-standalone on my own data as well as data downloaded from NCBI. I'm working with CDS nucleotide/DNA files. However, these CDS files have some gaps in represented as Ns. I believe this is when annotations are crossing contig boundaries or gaps in the genome assembly.

When running OMA standalone I'm getting many warnings like the ones below, and whilst I know they're probably just because of these gaps, I'm worried this will cause erroneous results, due to the X's being misaligned.

WARNING: IUPAC ambiguity characters for DNA/RNA not supported. Will replace them with 'X'

Pat index with 18353224 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXXXXXXXXXXX"

Pat index with 41395238 entries sorted, from "A</seq></e>\n" to "XXXXXXXXXAAAATATATC"

So my main question is, does OMA standalone account for these gaps, should I just leave the Ns in or is there a better way to go about this? And is using the CDS better than using the full genome?

For context, I'm trying to get orthologous groups to help build a species tree and I'm working with insect genomes.

Thank you, Emma

oma oma-standalone genome • 485 views
1
Entering edit mode
12 months ago

Dear Emma,

the IUPAC ambiguity characters are not the Ns (or X) which is the unknown character, but rather R,Y,S,W,K,M,B,D,H and V. These indicate ambiguous nucleotides, e.g. a R indicates that at this position it could be either a A or a G. see https://droog.gs.washington.edu/parc/images/iupac.html

OMA will just replace all of those ambiguous characters to the unknown character. While doing the alignments, the unknown character will be aligned without penalty or score (because it could be anything). however, if a gap needs to be created, this is still penalized.

You definitively want to use the CDS sequences and not the whole genome sequences for OMA. In most cases it is even wiser to us the protein sequence as it is faster and better tested than with the CDS sequences.

0
Entering edit mode

Thanks for your reply. I was confused by this because I checked my sequences for any characters which were not A,T,C,G or N and did not find any. I did try a dos2unix in case, but this didn't make any difference. And I've definitely set "InputDataType := 'DNA';" in the parameters file.

Is it possible it's complaining about the headers in my cds files? I might need to trim them, although they are on a single line, they look like this:

> lcl|NW_003803422.1_cds_NP_018493878.1_374 [gene=LOC100909045] [db_xref=GeneID:100909045] [protein=LOW QUALITY PROTEIN: GMP synthase [glutamine-hydrolyzing]-like] [exception=unclassified translation discrepancy] [protein_id=NP_018493878.1] [location=join(44861..45017,45206..45418,45752..46636,46639..46680,46683..47308,47433..47534)] [gbkey=CDS]


Thank you,

Emma

1
Entering edit mode

Hi Emma,

I don't think the length of the fasta header is a problem per se. Please make sure the '>' is really the first character on the line, so no spaces before. if you're still stuck, can you please send me an example directly or paste a link to a file here?

0
Entering edit mode

it might in any way not be a bad idea to shorten those fasta header lines, not sure OMA has a problem with it but I'm sure other programs do (so perhaps here as well thus). Just remove the parts that are not informative for this analysis but do keep them unique of course!