Reconstructing Alignments From Cigar Strings
2
2
Entering edit mode
12.2 years ago
Revh ▴ 60

I have a number of aligned sequences, and their corresponding CIGAR strings. What I want to do is to reconstruct the alignment based on this. Basically I want to do what: http://www.mathworks.co.jp/help/toolbox/bioinfo/ref/cigar2align.html does, only I do not have access to Matlab.

Do you guys have any suggestions on how to accomplish this? Programs, packages etc. A perl script would be perfect, but writing one is beyond my capabilities.

Using a multiple alignment tool like MAFFT is not doable due to the nature of my sequences.

multiple cigar alignment • 6.3k views
ADD COMMENT
2
Entering edit mode
12.1 years ago

Here is a simplistic C implementation. I've just handled the 'M/D/I/P' operators with the example given at http://www.mathworks.co.jp/help/toolbox/bioinfo/ref/cigar2align.html

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

int main(int argc,char** argv)
    {
    int i;
    for(i=1;i+1< argc;i+=2)
        {
        char* seq=argv[i];
        size_t pos=0UL;
        char* c=argv[i+1];
        while(*c!=0)
            {
            char* p2;
            if(!isdigit(*c))
                {
                fprintf(stderr,"bad cigar string: %s\n",argv[i+1]);
                return EXIT_FAILURE;
                }
            int n=strtol(c,&p2,10);
            if(n<=0)
                {
                fprintf(stderr,"bad cigar string: %s\n",argv[i+1]);
                return EXIT_FAILURE;
                }
            switch(*p2)
                {
                case 'M':
                    {
                    while(n>0)
                        {
                        fputc(seq[pos++],stdout);
                        --n;
                        }
                    break;
                    }
                case 'D':
                    {
                    while(n>0)
                        {
                        fputc('-',stdout);
                        --n;
                        }
                    break;
                    }
                case 'I':
                case 'P':
                    {
                    /* ignore */
                    break;
                    }
              default:
                {
                fprintf(stderr,"unsupported operator %c\n",*p2);
                return EXIT_FAILURE;
                }
                }
            c=p2+1;
            }
        fputc('\n',stdout);
        }
    return EXIT_SUCCESS;
    }

Compilation:

gcc -Wall -o cigar2string cigar2string.c

test:

$ ./cigar2string  ACGACTGC 3M1D1M1I3M ACGTTGC 4M1D1P3M AGGTATC 5M1P1M1D1M
ACG-ACTG
ACGT-TGC
AGGTAT-C
ADD COMMENT
1
Entering edit mode

This works for me (with a little modification), but more complex CIGAR strings (containing multiple insertions and deletions) seem to cause it problems.

ADD REPLY
0
Entering edit mode
12.1 years ago
Michael 54k

I have had a similar question referring to percent identity. In the end I found it easier to recalculate the alignments using Smith-Waterman in R for small numbers of sequences.

If you output the alignments instead of calculating the percent identity that should work. I'm not saying that it doesn't work using the CIGAR string but it was just simpler for me to code it that way.

ADD COMMENT

Login before adding your answer.

Traffic: 2657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6