Question

Coverage In Bam Alignment

3

Entering edit mode

12.0 years ago

MarioRio ▴ 40

Hi. I have a C++ program that scan a BAM sorted file and returns some informations, but i don't know how to manage the coverage of the alignment. this is the program

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include "bam.h"
#include "sam.h"

int main(int argc, char *argv[]) {

  samfile_t *fp_in = NULL;
  bam1_t *b=NULL;
  if(argc!=2) return EXIT_FAILURE;

  fp_in = samopen(argv[1], "rb", 0);

  if(NULL == fp_in)
        {
       fprintf(stderr,"Could not open file\n");
       return EXIT_FAILURE;
       }

  b = bam_init1();
  int pos=0;
  int lastpos=0;

  while(samread(fp_in, b) > 0)
    {
    lastpos = pos;
    pos = b->core.pos;


     printf("tid  : %d\n",b->core.tid);
     if(b->core.tid>=0)
        {
        printf("seqname: %s\n", fp_in->header->target_name[b->core.tid]);
        printf("seqlength: %d\n", fp_in->header->target_len[b->core.tid]);
        }
      printf("pos  : %d\n",b->core.pos);
      char *name  = bam1_qname(b);
      unsigned char *qual  = bam1_qual(b);
      uint32_t* cigar= bam1_cigar(b);
      int n=0;
      char *qseq = (char *) malloc(b->core.l_qseq+1);
      unsigned char *s   = bam1_seq(b);
      for(n=0;n<(b->core.l_qseq);n++) {
        int v = bam1_seqi(s,n);
        qseq[n] = bam_nt16_rev_table[v];
      }
      qseq[n] = 0;

      int cigar_leng=bam_cigar2qlen(&(b->core),cigar);
      printf("name : %s\n",name);
      printf("qseq : %s\n",qseq);
      printf("cigar-size: %d\n",cigar_leng);
       printf("cigar:");
      int k;
      for( k=0;k< b->core.n_cigar;++k)
        {
        int cop =cigar[k] & BAM_CIGAR_MASK; // operation
        int cl = cigar[k] >> BAM_CIGAR_SHIFT; // length
        switch(cop)
                {
                case BAM_CMATCH: printf("M");break;
                case BAM_CINS: printf("I");break;
                case BAM_CDEL: printf("D");break;
                case BAM_CREF_SKIP: printf("N"); break;
                case BAM_CSOFT_CLIP: printf("S");break;
                case BAM_CHARD_CLIP: printf("R");break;
                case BAM_CPAD: printf("P");break;
                default:printf("?");assert(0);break;
                }
        printf("%d",cl);
        }
      printf("\n");

      printf("qual :");
      for(n=0;n<(b->core.l_qseq);n++) {
        printf(" %d",qual[n]);
      }
      printf("\n\n");

    bam_destroy1(b);
    b = bam_init1();
  }
  bam_destroy1(b);

  samclose(fp_in);
  return 0;
}

i don't how to get the coverage of the alignments from the BAM file.

thanks in advance

coverage bam alignment • 5.6k views

ADD COMMENT • link updated 12.0 years ago by Arun 2.4k • written 12.0 years ago by MarioRio ▴ 40

score 4 · Answer 1 · 2012-05-14

4

Entering edit mode

12.0 years ago

Richard ▴ 590

Hi MarioRio,

It probably goes against your strategy of coding your own solution. However, getting the depth is really easy with tools like SAMtools or BEDTools.

SAMtools (search for "precise depth"): http://samtools.sourceforge.net/mpileup.shtml

BedTools (check out the examples posted here): http://code.google.com/p/bedtools/wiki/UsageAdvanced

ADD COMMENT • link 12.0 years ago by Richard ▴ 590

1

Entering edit mode

If I understand it, you should not use tools to do it, but you have to write code, and as coverage you mean how many times are the bases in a certain position of chromosome

ADD REPLY • link 12.0 years ago by TwoFaces ▴ 10

score 4 · Answer 2 · 2012-05-15

I had to do the same for a colleague of mine some months back. Here's the perl version. You should have BIO::DB::Sam installed.

#!/usr/bin/perl 
use strict;
use warnings;
use Bio::DB::Sam;

# all necessary inputs (I use Getopt::Long to obtain input parameters)
my $opt_i     = "./input.bam"; # should have an index file named input.bam.bai
my $opt_f     = "./species.fasta"; 
my $chr_id    = 'Chr1'; # your chromosome-id
my $snp_pos = 251902; # position you want to call all variants in this position 
# create the object
my $sam = Bio::DB::Sam->new( -fasta => $opt_f, -bam => $opt_i );

# get all reads that cover this SNP -- therefore, start and end set to $snp_pos
my @alignments = $sam->get_features_by_location(-seq_id => $chr_id, -start => $snp_pos, -end => $snp_pos );

my %res; # output hash that'll contain the count of each available nucleotide (or blank if the read covering the SNP is spliced in this position).
# loop over all alignments
for my $cAlign (@alignments) { 
    # parameters we'll need from our bam file
    my  $start     = $cAlign->start;# get start coordinate
    my $end       = $cAlign->end;    # get end coordinate
    my $ref_seq   = $cAlign->dna;    # get reference DNA sequence
    my $read_seq  = $cAlign->query->dna;    # get query DNA sequence
    my $cigar     = $cAlign->cigar_str;# get CIGAR sequence
    my $cigar_ref = $cAlign->cigar_array;    # probably the important useful of all. splits cigar to array of array reference
      # Ex: $cigar = 20M100N50M => $cigar_ref = [ [ 'M' 20 ] [ 'N' 100] ['M' 50] ]

    my $ref_cntr     = $start;     # $ref_cntr = assign start to counter variable for reference. This will ultimately
                           # keep track of the current position on the chromosome
    my $read_cntr    = 0;           # $read_cntr = computes offset on the read
    my $read_snp_pos = "";    # variable to hold base at $snp_pos from current read

foreach my $deref ( @$cigar_ref ) {
my $cigar_chr = $deref->[0];# cigar character (ex: M, N, I, D etc...)
    my $len_chr   = $deref->[1];# number corresponding to `this` cigar character ( ex: 20, 100 )

# NOTE: I => insertion in to the reference, meaning the read has the base(s) but the REFERENCE fasta does NOT
#       D => deletion from the reference, meaning the READ does NOT have the base(s), but the reference does

        # modify reference counter only on M, N or D occurrence
        if( $cigar_chr eq "M"  || $cigar_chr eq "N" || $cigar_chr eq "D" ) {
            $ref_cntr += $len_chr;
        }

        # modify offset only on M or I occurrence
        if( $cigar_chr eq "M" || $cigar_chr eq "I" ) {
            $read_cntr += $len_chr;
        }

        # now, check if the current operation takes ref_cntr past the SNP position. If it does,
        # 1) If the current operation is NOT "M", then its either "N" or "D". Both of them mean
        # that the read doesn't have a base at the SNP. So, this read is either spliced or has 
        # a deletion at that point and is not useful for your SNP location.
        # 2) If the current position IS "M", then the current operation has gotten is past SNP
        # location. So, we FIRST SUBTRACT what we added in this operation for ref_cntr and read_cntr, 
        # and then just add the difference ($snp_pos - $ref_cntr + 1)

        if( $ref_cntr > $snp_pos ) { 
            if( $cigar_chr eq "M" ) {
                $ref_cntr  -= $len_chr;
                $read_cntr -= $len_chr;
                $read_cntr += ( $snp_pos - $ref_cntr + 1 );
                $read_snp_pos = substr( $read_seq, $read_cntr-1, 1 );
            }
            # if $cigar_chr is "N" or "D", do nothing. $read_snp_pos is set to ""
            # add value of $read_snp_pos to hash and get out of loop - to next read
            $res{$read_snp_pos}++;
            last;
        }
    }
}

# Here, I am just printing to output.
print "Count\n";
foreach my $key ( keys %res ) {
    print "$key:$res{$key}\n";
}
print "--\n";

I am not sure if the comments are so useful. I created them sometime back and did not read thro' them now. Just save this to a file with .pl extension and run perl file.pl after setting the path of your bam, fasta, chr_id and snp_pos. To extend this over many snp positions you could either loop over this code or write shell script to call this code in parallel (make sure NOT to run ALL SNPs at the same time :) ). I tested this code and it works.

Oh I should mention, even though its obvious, that I did not implement "S" - soft clip. But I guess they don't affect the positions as the POS_0 of the SAM file contains the first match... Right?

Good luck!

score 2 · Answer 3 · 2012-05-14

2

Entering edit mode

12.0 years ago

Pierre Lindenbaum 161k

You need to use the pileup API in the samtools package: see this example: http://samtools.sourceforge.net/sam-exam.shtml

ADD COMMENT • link 12.0 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Thanks to you both. For each positions and chromosomes i also need to know what bases are in and how many times. For example in position 2688164 in chr1 there are 12 A and 15 T, i would like to extract this information.

ADD REPLY • link 12.0 years ago by MarioRio ▴ 40