Samtools Api: Sequence And Quality Are Inconsistent
2
1
Entering edit mode
10.8 years ago

I am writing my first program that uses the SAMtools API, and I am running into an error. I have created a simple example program that isolates the error, which is shown below. Essentially, when I run the program on a .sam file generated by bwa aln and bwa samse, I get a warning message about a missing CIGAR sequence, and then an error message about the sequence and quality being inconsistent.

Is this a problem with my code or with the data?


Here is the stripped-down program....

#include <stdio.h>
#include "sam.h"

int main(int argc, char **argv)
{
  samfile_t *sam = samopen(argv[1], "r", NULL);
  if(sam == NULL)
  {
    fprintf(stderr, "file open error: '%s'\n", argv[1]);
    return 1;
  }

  bam_header_t *header = sam_header_read(sam->x.tamr);
  if(header == NULL)
  {
    fprintf(stderr, "SAM header parse error\n");
    return 1;
  }

  bam1_t *alignment = bam_init1();
  int bytesread;
  while((bytesread = samread(sam, alignment)) > 0)
  {
    if(alignment->core.flag & 4)
    {
      // unmapped; ignore
      continue;
    }

    char *seqid = header->target_name[alignment->core.tid];
    unsigned long left = alignment->core.pos + 1;
    unsigned long right = bam_calend(&alignment->core, bam1_cigar(alignment)) + 1;
    char strand = (alignment->core.flag & 16) ? '+' : '-';
    printf("%s[%lu, %lu]%c\n", seqid, left, right, strand);
  }

  bam_header_destroy(header);
  bam_destroy1(alignment);
  samclose(sam);
  return 0;
}

....the first few lines of the .sam file...

@SQ    SN:chr1    LN:249250621
HWUSI-EAS566_0006:1:1:1018:13259#0    0    chr1    162546618    37    26M    *    0    0    GGCAGGGGACAAGACTCGGCGTTGCA    e\cT\`bbT\]Y`Y^^a^^c]\L_]`    XT:A:U    NM:i:0    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:26
HWUSI-EAS566_0006:1:1:1019:11220#0    4    *    0    0    *    *    0    0    AGAAAGAGCTCATGCAGCGTCTTAAG    ab\a\b^ccacaaaL]]]]Ra^^^^a
HWUSI-EAS566_0006:1:1:1025:16447#0    16    chr1    183604878    25    26M    *    0    0    ACTGATTCCCTTCCCTCTTCCGCTCT    \_Y]\Y^^L^^Xa`a`^_JacL\_b^    XT:A:U    NM:i:2    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:0C7A17
HWUSI-EAS566_0006:1:1:1026:12890#0    4    *    0    0    *    *    0    0    TGGCTAAGAGGGAGTGGGTGTTGCGG    cccc`T]`Y`\\O]Wb\b_b^caLc\
HWUSI-EAS566_0006:1:1:1027:12527#0    4    *    0    0    *    *    0    0    CCTTCTCTCTTCCTCGGCGCTGCCTA    dbd`dad^bda^^b\^acaaY__T\`
HWUSI-EAS566_0006:1:1:1028:10989#0    4    *    0    0    *    *    0    0    AGTTCCGGCGCAGCCGGGATTTGGGT    \ddddccLacddcdadd`[_Y]__]^
HWUSI-EAS566_0006:1:1:1031:18950#0    4    *    0    0    *    *    0    0    GTAATGGCTTCAAGGACTATCAATGC    cdYLccTbcb`^\bTHU]_Qbb\T`\
HWUSI-EAS566_0006:1:1:1037:2137#0    4    *    0    0    *    *    0    0    GGCTTGCTAAGCAGAGGCCGGAAGCG    eYeeea`^bb_\aY\_d_daY^^^^_
HWUSI-EAS566_0006:1:1:1037:19136#0    4    *    0    0    *    *    0    0    ATATCTGCTTCCGACACAGCTGCAAT    aT\YT_Yb_Y\aa``TY`]YT^\Y`a

...and the error that gets generated.

[standage@lappy ~] gcc -Wall -I /path/to/samtools/ -L /path/to/samtools/ -o test test.c -lbam -lz
[standage@lappy ~] ./test test.sam
[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '162546618' is recognized as '*'.
Parse warning at line 2: mapped sequence without CIGAR
Parse error at line 2: sequence and quality are inconsistent
Abort trap: 6
[standage@lappy ~]
samtools api sam error • 6.3k views
ADD COMMENT
0
Entering edit mode

compile with '-g', and run the GNU debuger

    $ gdb test
   > run test.sam
   > (....)
  > backtrace
ADD REPLY
0
Entering edit mode
[standage@lappy ~] gdb test
GNU gdb 6.3.50-20050815 (Apple version gdb-1822) (Sun Aug  5 03:00:42 UTC 2012)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries ... done

(gdb) run test.sam 
Starting program: /Users/standage/test test.sam
Reading symbols for shared libraries ++.............................. done
[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '162546618' is recognized as '*'.
Parse warning at line 2: mapped sequence without CIGAR
Parse error at line 2: sequence and quality are inconsistent

Program received signal SIGABRT, Aborted.
0x00007fff91ff7d46 in __kill ()
(gdb) backtrace
#0  0x00007fff91ff7d46 in __kill ()
#1  0x00007fff8911cdf0 in abort ()
#2  0x0000000100007ce6 in sam_read1 (fp=0x576a, header=0x7fff5fbff670, b=0x7fff5fbff670) at bam_import.c:167
#3  0x0000000100000c6a in main (argc=2, argv=0x7fff5fbff720) at test.c:22
(gdb) q
The program is running.  Exit anyway? (y or n) y
[standage@lappy ~]
ADD REPLY
0
Entering edit mode

The sam_read1 message seems to indicate that '162546618' is being interpreted as the reference sequence name, when it's clear that this is the position of the alignment.

ADD REPLY
0
Entering edit mode

Maybe using the BAMtools(c++) or Picard (Java) API is a bit more user-friendly (and documented).

ADD REPLY
5
Entering edit mode
10.8 years ago

The problem is with the code, which uses sam.h's higher-level samfile_t API incorrectly.

It is, sadly, completely under-documented, but samopen() reads the SAM/BAM headers for you. If you compare the code in sam_view.c's main_samview(), you will see that there is no need to call sam_header_read() yourself; instead just use sam->headers. These headers will be destroyed for you by samclose(), so there is no need to call bam_header_destroy() yourself either.

Your code effectively calls sam_header_read() twice, and that function reads an extra field from the next line to see whether it starts with @. samread() expects to need to allow for one field perhaps already having been read but not two fields, thus leading to the symptoms you have seen.

The solution to this lack of documentation and flexibility in the API is that samtools is moving to use the htslib API underneath: see this samtools-devel thread for details. We will then focus on making the htslib API well-documented, efficient, and usable.

So I am afraid now is not a great time to be learning the existing Samtools API, though there will still be a transitional wrapper library exposing (most of) the existing API in the short term. Organisational aspects of the htslib API are in flux at the moment too, and the API is currently somewhat incomplete. But in the medium term the htslib API will be more fun for developers than the old one.

ADD COMMENT
1
Entering edit mode

Thanks for the answer and a heads up to these exciting new developments for samtools. I want to note though that the samtools-devel mailing list in near unreadable via the web - even in this day an age Sourceforge seems to be unable to wrap lines from different mailers. That precludes interested parties from just trying to keep up with what is new without subscribing to the list.

ADD REPLY
0
Entering edit mode

It is indeed amazing how bad the SF list archives are! I tend to use those archive URLs as pointers, and assume that anyone actually wanting to read the text will use the subject and date to find the messages in their own local mbox archives -- we're all subscribed to these lists, right? :-) There must be dozens of SF tickets filed about this, and a brief search suggests there is reason to hope this will be fixed, though I suppose they'll break all the archival URLs to the archives...

ADD REPLY
0
Entering edit mode

Since you are already moving the code to github perhaps it would be a good time to also move the email list to say Google Groups?

ADD REPLY
0
Entering edit mode

(The source code repository moved to GitHub over a year ago.) The plan is to take advantage of functionality of both SourceForge and GitHub where appropriate, e.g., GitHub offers neither file downloads for releases nor mailing lists, so those won't be moving. Note BTW that the mailing lists also host SAM specification, Picard, and other tools discussions as well as samtools, so moving them would cause quite a lot of disruption. So it wouldn't really be up to the samtools maintainers to move the lists, and even if it was, we've no plans to do so.

ADD REPLY
0
Entering edit mode

Thanks for the thorough response!

ADD REPLY
0
Entering edit mode
10.8 years ago

Your quality string encoding does not seem to be in Sanger format, see http://en.wikipedia.org/wiki/FASTQ_format

ADD COMMENT
0
Entering edit mode

Definitely a silly oversight on my part. But even after converting the Fastq data to Sanger encoding and re-running BWA (which may have been overkill), the error remains.

@SQ    SN:chr1    LN:249250621
HWUSI-EAS566_0006:1:1:1018:13259#0    0    chr1    162546618    37    26M    *    0    0    GGCAGGGGACAAGACTCGGCGTTGCA    F=D5=ACC5=>:A:??B??D>=-@>A    XT:A:U    NM:i:0    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:26
HWUSI-EAS566_0006:1:1:1019:11220#0    4    *    0    0    *    *    0    0    AGAAAGAGCTCATGCAGCGTCTTAAG    BC=B=C?DDBDBBB->>>>3B????B
HWUSI-EAS566_0006:1:1:1025:16447#0    16    chr1    183604878    25    26M    *    0    0    ACTGATTCCCTTCCCTCTTCCGCTCT    =@:>=:??-??9BABA?@+BD-=@C?    XT:A:U    NM:i:2    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:0C7A17
HWUSI-EAS566_0006:1:1:1026:12890#0    4    *    0    0    *    *    0    0    TGGCTAAGAGGGAGTGGGTGTTGCGG    DDDDA5>A:A==0>8C=C@C?DB-D=
HWUSI-EAS566_0006:1:1:1027:12527#0    4    *    0    0    *    *    0    0    CCTTCTCTCTTCCTCGGCGCTGCCTA    ECEAEBE?CEB??C=?BDBB:@@5=A
HWUSI-EAS566_0006:1:1:1028:10989#0    4    *    0    0    *    *    0    0    AGTTCCGGCGCAGCCGGGATTTGGGT    =EEEEDD-BDEEDEBEEA<@:>@@>?
HWUSI-EAS566_0006:1:1:1031:18950#0    4    *    0    0    *    *    0    0    GTAATGGCTTCAAGGACTATCAATGC    DE:-DD5CDCA?=C5)6>@2CC=5A=
HWUSI-EAS566_0006:1:1:1037:2137#0    4    *    0    0    *    *    0    0    GGCTTGCTAAGCAGAGGCCGGAAGCG    F:FFFBA?CC@=B:=@E@EB:????@
HWUSI-EAS566_0006:1:1:1037:19136#0    4    *    0    0    *    *    0    0    ATATCTGCTTCCGACACAGCTGCAAT    B5=:5@:C@:=BBAA5:A>:5?=:AB
ADD REPLY

Login before adding your answer.

Traffic: 2417 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6