Question

Reconstructing Alignments From Cigar Strings

2

Entering edit mode

13.4 years ago

Revh ▴ 60

I have a number of aligned sequences, and their corresponding CIGAR strings. What I want to do is to reconstruct the alignment based on this. Basically I want to do what: http://www.mathworks.co.jp/help/toolbox/bioinfo/ref/cigar2align.html does, only I do not have access to Matlab.

Do you guys have any suggestions on how to accomplish this? Programs, packages etc. A perl script would be perfect, but writing one is beyond my capabilities.

Using a multiple alignment tool like MAFFT is not doable due to the nature of my sequences.

multiple cigar alignment • 6.8k views

ADD COMMENT • link updated 11.4 years ago by Biostar 20 • written 13.4 years ago by Revh ▴ 60

score 2 · Answer 1 · 2012-02-28

Here is a simplistic C implementation. I've just handled the 'M/D/I/P' operators with the example given at http://www.mathworks.co.jp/help/toolbox/bioinfo/ref/cigar2align.html

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int main(int argc,char** argv)
    {
    int i;
    for(i=1;i+1< argc;i+=2)
        {
        char* seq=argv[i];
        size_t pos=0UL;
        char* c=argv[i+1];
        while(*c!=0)
            {
            char* p2;
            if(!isdigit(*c))
                {
                fprintf(stderr,"bad cigar string: %s\n",argv[i+1]);
                return EXIT_FAILURE;
                }
            int n=strtol(c,&p2,10);
            if(n<=0)
                {
                fprintf(stderr,"bad cigar string: %s\n",argv[i+1]);
                return EXIT_FAILURE;
                }
            switch(*p2)
                {
                case 'M':
                    {
                    while(n>0)
                        {
                        fputc(seq[pos++],stdout);
                        --n;
                        }
                    break;
                    }
                case 'D':
                    {
                    while(n>0)
                        {
                        fputc('-',stdout);
                        --n;
                        }
                    break;
                    }
                case 'I':
                case 'P':
                    {
                    /* ignore */
                    break;
                    }
              default:
                {
                fprintf(stderr,"unsupported operator %c\n",*p2);
                return EXIT_FAILURE;
                }
                }
            c=p2+1;
            }
        fputc('\n',stdout);
        }
    return EXIT_SUCCESS;
    }

Compilation:

gcc -Wall -o cigar2string cigar2string.c

test:

$ ./cigar2string  ACGACTGC 3M1D1M1I3M ACGTTGC 4M1D1P3M AGGTATC 5M1P1M1D1M
ACG-ACTG
ACGT-TGC
AGGTAT-C

Ram · Answer 2 · 2012-02-28

I have had a similar question referring to percent identity. In the end I found it easier to recalculate the alignments using Smith-Waterman in R for small numbers of sequences.

If you output the alignments instead of calculating the percent identity that should work. I'm not saying that it doesn't work using the CIGAR string but it was just simpler for me to code it that way.