Question: C Api For Bam: Iterating Over A Cigar String.
4
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum117k wrote:

Hi, this question is somehow related to this previous question. I'm playing with the C API for BAM and I wrote the following code: https://gist.github.com/736059

I'm now trying to find the indexes of the genomic bases covered by the CIGAR string (my final goal is to create a WIG file containing the coverage of the genome)

What would be the correct way to change my code to get the genomic indexes of the bases covered by each CIGAR element ?

(...)
for( k=0;k< b->core.n_cigar;++k)
{
int cop =cigar[k] & BAM_CIGAR_MASK; // operation
int cl = cigar[k] >> BAM_CIGAR_SHIFT; // length
switch(cop)
 {
 case BAM_CMATCH: printf("M");break;
 case BAM_CINS: printf("I");break;
 case BAM_CDEL: printf("D");break;
 case BAM_CREF_SKIP: printf("N"); break;
 case BAM_CSOFT_CLIP: printf("S");break;
 case BAM_CHARD_CLIP: printf("R");break;
 case BAM_CPAD: printf("P");break;
 default:printf("?");break;
 }
printf("%d",cl);
}
(...)

Thanks

cigar wiggle bam coverage • 3.3k views
ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum117k
2

Have you looked at bamtools: https://github.com/pezmaster31/bamtools ? It provides a very clean C++ BAM API.

ADD REPLYlink written 8.2 years ago by Erik Garrison2.1k

(That noted, it won't help make this any easier.)

ADD REPLYlink written 8.2 years ago by Erik Garrison2.1k

Yes thanks, I saw it, but I want to learn the 'official' API and how the data should be handled

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum117k
3
gravatar for lh3
8.2 years ago by
lh331k
United States
lh331k wrote:

For example

int depth[REF_LEN], x, j, k;
uint32_t *cigar = bam1_cigar(b);
for (k = 0, x = b->core.pos; k < b->core.n_cigar; ++k) {
  int op = cigar[k]&16;
  int l = cigar[k]>>4;
  if (op == BAM_CMATCH) {
    for (j = x; j < x + l; ++j) ++depth[j];
    x += l;
  } else if (op == BAM_CREF_SKIP || op == BAM_CDEL) x += l;
}

Something like this...

ADD COMMENTlink modified 6.8 years ago by Istvan Albert ♦♦ 79k • written 8.2 years ago by lh331k

While this does work for an unsorted file, on a sorted file I think a progressive approach (ie. one that does not build a genome size array) would be preferable.

ADD REPLYlink written 8.2 years ago by Jonathan Manning620

That calDepth.c only aims to exemplify how to use pileup APIs. For depth only, the code block above is the fastest and easiest. Given a sorted file, you can allocate an array per chromosome and then free it. The peak memory is only about 1GB.

ADD REPLYlink written 8.2 years ago by lh331k
2
gravatar for Jonathan Manning
8.2 years ago by
Near Boston, MA
Jonathan Manning620 wrote:

Short answer, see samtools/examples/calDepth.c for a pileup based solution. The CIGAR is resolved within the pileup, avoiding the need to walk the cigar string yourself.

Still, to answer the direct question - 'get the genomic indexes of the bases covered by each CIGAR element', I would implement that switch statement as:

/* NB: Closed ranges */
switch (cop) {
 case BAM_CMATCH:
      printf("M");
      printf("[%d-%d]", pos, pos + cl - 1);
      pos+=cl;
      break;

 case BAM_CHARD_CLIP:
      printf("H");
      /* printf("[%d]", pos);  // No coverage */
      /* pos is not advanced by this operation */
      break;

 case BAM_CSOFT_CLIP:
      printf("S");
      /* printf("[%d]", pos);  // No coverage */
      /* pos is not advanced by this operation */
      break;

 case BAM_CDEL:
      printf("D");
      /* printf("[%d-%d]", pos, pos + cl - 1);  // Spans positions, No Coverage */
      pos+=cl;
      break;

 case BAM_CPAD:
      printf("P");
      /* printf("[%d-%d]", pos, pos + cl - 1); // Spans positions, No Coverage */
      pos+=cl;
      break;

 case BAM_CINS:
      printf("I");
      /* printf("[%d]", pos); // Special case - adds <cl> bp "throughput", but not genomic position "coverage" */
      /* How you handle this is application dependent */
      /* pos is not advanced by this operation */
      break;

 case BAM_CREF_SKIP:
      printf("S");
      /* printf("[%d-%d]", pos, pos + cl - 1); /* Spans positions, No Coverage */
      pos+=cl;
      break;

 default:
      fprintf(stderr, "Unhandled cigar_op %d:%d\n", cop, cl);
      printf("?");
}
ADD COMMENTlink written 8.2 years ago by Jonathan Manning620

That's what I was looking for ! Many thanks !

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum117k

James Bonfield also posted an alternative pileup implementation (with a usable switch/case block), but with some subtly different edge cases, to the mailing list recently. It seems to be omitted from the archives, but was called sam_depth.c.gz, and was the first message in this thread: http://sourceforge.net/mailarchive/message.php?msg_id=26502507

ADD REPLYlink written 8.2 years ago by Jonathan Manning620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1391 users visited in the last hour