Question: Samtools Api: Sequence And Quality Are Inconsistent
1
gravatar for Daniel Standage
5.7 years ago by
Daniel Standage3.8k
Davis, California, USA
Daniel Standage3.8k wrote:

I am writing my first program that uses the SAMtools API, and I am running into an error. I have created a simple example program that isolates the error, which is shown below. Essentially, when I run the program on a .sam file generated by bwa aln and bwa samse, I get a warning message about a missing CIGAR sequence, and then an error message about the sequence and quality being inconsistent.

Is this a problem with my code or with the data?


Here is the stripped-down program....

#include <stdio.h>
#include "sam.h"

int main(int argc, char **argv)
{
  samfile_t *sam = samopen(argv[1], "r", NULL);
  if(sam == NULL)
  {
    fprintf(stderr, "file open error: '%s'\n", argv[1]);
    return 1;
  }

  bam_header_t *header = sam_header_read(sam->x.tamr);
  if(header == NULL)
  {
    fprintf(stderr, "SAM header parse error\n");
    return 1;
  }

  bam1_t *alignment = bam_init1();
  int bytesread;
  while((bytesread = samread(sam, alignment)) > 0)
  {
    if(alignment->core.flag & 4)
    {
      // unmapped; ignore
      continue;
    }

    char *seqid = header->target_name[alignment->core.tid];
    unsigned long left = alignment->core.pos + 1;
    unsigned long right = bam_calend(&alignment->core, bam1_cigar(alignment)) + 1;
    char strand = (alignment->core.flag & 16) ? '+' : '-';
    printf("%s[%lu, %lu]%c\n", seqid, left, right, strand);
  }

  bam_header_destroy(header);
  bam_destroy1(alignment);
  samclose(sam);
  return 0;
}

....the first few lines of the .sam file...

@SQ    SN:chr1    LN:249250621
HWUSI-EAS566_0006:1:1:1018:13259#0    0    chr1    162546618    37    26M    *    0    0    GGCAGGGGACAAGACTCGGCGTTGCA    e\cT\`bbT\]Y`Y^^a^^c]\L_]`    XT:A:U    NM:i:0    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:26
HWUSI-EAS566_0006:1:1:1019:11220#0    4    *    0    0    *    *    0    0    AGAAAGAGCTCATGCAGCGTCTTAAG    ab\a\b^ccacaaaL]]]]Ra^^^^a
HWUSI-EAS566_0006:1:1:1025:16447#0    16    chr1    183604878    25    26M    *    0    0    ACTGATTCCCTTCCCTCTTCCGCTCT    \_Y]\Y^^L^^Xa`a`^_JacL\_b^    XT:A:U    NM:i:2    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:0C7A17
HWUSI-EAS566_0006:1:1:1026:12890#0    4    *    0    0    *    *    0    0    TGGCTAAGAGGGAGTGGGTGTTGCGG    cccc`T]`Y`\\O]Wb\b_b^caLc\
HWUSI-EAS566_0006:1:1:1027:12527#0    4    *    0    0    *    *    0    0    CCTTCTCTCTTCCTCGGCGCTGCCTA    dbd`dad^bda^^b\^acaaY__T\`
HWUSI-EAS566_0006:1:1:1028:10989#0    4    *    0    0    *    *    0    0    AGTTCCGGCGCAGCCGGGATTTGGGT    \ddddccLacddcdadd`[_Y]__]^
HWUSI-EAS566_0006:1:1:1031:18950#0    4    *    0    0    *    *    0    0    GTAATGGCTTCAAGGACTATCAATGC    cdYLccTbcb`^\bTHU]_Qbb\T`\
HWUSI-EAS566_0006:1:1:1037:2137#0    4    *    0    0    *    *    0    0    GGCTTGCTAAGCAGAGGCCGGAAGCG    eYeeea`^bb_\aY\_d_daY^^^^_
HWUSI-EAS566_0006:1:1:1037:19136#0    4    *    0    0    *    *    0    0    ATATCTGCTTCCGACACAGCTGCAAT    aT\YT_Yb_Y\aa``TY`]YT^\Y`a

...and the error that gets generated.

[standage@lappy ~] gcc -Wall -I /path/to/samtools/ -L /path/to/samtools/ -o test test.c -lbam -lz
[standage@lappy ~] ./test test.sam
[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '162546618' is recognized as '*'.
Parse warning at line 2: mapped sequence without CIGAR
Parse error at line 2: sequence and quality are inconsistent
Abort trap: 6
[standage@lappy ~]
api samtools sam error • 4.2k views
ADD COMMENTlink modified 5.7 years ago by John Marshall1.5k • written 5.7 years ago by Daniel Standage3.8k

compile with '-g', and run the GNU debuger

    $ gdb test
   > run test.sam
   > (....)
  > backtrace
ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Pierre Lindenbaum117k
[standage@lappy ~] gdb test
GNU gdb 6.3.50-20050815 (Apple version gdb-1822) (Sun Aug  5 03:00:42 UTC 2012)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries ... done

(gdb) run test.sam 
Starting program: /Users/standage/test test.sam
Reading symbols for shared libraries ++.............................. done
[samopen] SAM header is present: 1 sequences.
[sam_read1] reference '162546618' is recognized as '*'.
Parse warning at line 2: mapped sequence without CIGAR
Parse error at line 2: sequence and quality are inconsistent

Program received signal SIGABRT, Aborted.
0x00007fff91ff7d46 in __kill ()
(gdb) backtrace
#0  0x00007fff91ff7d46 in __kill ()
#1  0x00007fff8911cdf0 in abort ()
#2  0x0000000100007ce6 in sam_read1 (fp=0x576a, header=0x7fff5fbff670, b=0x7fff5fbff670) at bam_import.c:167
#3  0x0000000100000c6a in main (argc=2, argv=0x7fff5fbff720) at test.c:22
(gdb) q
The program is running.  Exit anyway? (y or n) y
[standage@lappy ~]
ADD REPLYlink written 5.7 years ago by Daniel Standage3.8k

The sam_read1 message seems to indicate that '162546618' is being interpreted as the reference sequence name, when it's clear that this is the position of the alignment.

ADD REPLYlink written 5.7 years ago by Daniel Standage3.8k

Maybe using the BAMtools(c++) or Picard (Java) API is a bit more user-friendly (and documented).

ADD REPLYlink written 5.7 years ago by William4.4k
5
gravatar for John Marshall
5.7 years ago by
John Marshall1.5k
Glasgow, Scotland
John Marshall1.5k wrote:

The problem is with the code, which uses sam.h's higher-level samfile_t API incorrectly.

It is, sadly, completely under-documented, but samopen() reads the SAM/BAM headers for you. If you compare the code in sam_view.c's main_samview(), you will see that there is no need to call sam_header_read() yourself; instead just use sam->headers. These headers will be destroyed for you by samclose(), so there is no need to call bam_header_destroy() yourself either.

Your code effectively calls sam_header_read() twice, and that function reads an extra field from the next line to see whether it starts with @. samread() expects to need to allow for one field perhaps already having been read but not two fields, thus leading to the symptoms you have seen.

The solution to this lack of documentation and flexibility in the API is that samtools is moving to use the htslib API underneath: see this samtools-devel thread for details. We will then focus on making the htslib API well-documented, efficient, and usable.

So I am afraid now is not a great time to be learning the existing Samtools API, though there will still be a transitional wrapper library exposing (most of) the existing API in the short term. Organisational aspects of the htslib API are in flux at the moment too, and the API is currently somewhat incomplete. But in the medium term the htslib API will be more fun for developers than the old one.

ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by John Marshall1.5k
1

Thanks for the answer and a heads up to these exciting new developments for samtools. I want to note though that the samtools-devel mailing list in near unreadable via the web - even in this day an age Sourceforge seems to be unable to wrap lines from different mailers. That precludes interested parties from just trying to keep up with what is new without subscribing to the list.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 79k

It is indeed amazing how bad the SF list archives are! I tend to use those archive URLs as pointers, and assume that anyone actually wanting to read the text will use the subject and date to find the messages in their own local mbox archives -- we're all subscribed to these lists, right? :-) There must be dozens of SF tickets filed about this, and a brief search suggests there is reason to hope this will be fixed, though I suppose they'll break all the archival URLs to the archives...

ADD REPLYlink written 5.7 years ago by John Marshall1.5k

Since you are already moving the code to github perhaps it would be a good time to also move the email list to say Google Groups?

ADD REPLYlink written 5.7 years ago by Istvan Albert ♦♦ 79k

(The source code repository moved to GitHub over a year ago.) The plan is to take advantage of functionality of both SourceForge and GitHub where appropriate, e.g., GitHub offers neither file downloads for releases nor mailing lists, so those won't be moving. Note BTW that the mailing lists also host SAM specification, Picard, and other tools discussions as well as samtools, so moving them would cause quite a lot of disruption. So it wouldn't really be up to the samtools maintainers to move the lists, and even if it was, we've no plans to do so.

ADD REPLYlink written 5.7 years ago by John Marshall1.5k

Thanks for the thorough response!

ADD REPLYlink written 5.7 years ago by Daniel Standage3.8k
0
gravatar for Istvan Albert
5.7 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

Your quality string encoding does not seem to be in Sanger format, see http://en.wikipedia.org/wiki/FASTQ_format

ADD COMMENTlink written 5.7 years ago by Istvan Albert ♦♦ 79k

Definitely a silly oversight on my part. But even after converting the Fastq data to Sanger encoding and re-running BWA (which may have been overkill), the error remains.

@SQ    SN:chr1    LN:249250621
HWUSI-EAS566_0006:1:1:1018:13259#0    0    chr1    162546618    37    26M    *    0    0    GGCAGGGGACAAGACTCGGCGTTGCA    F=D5=ACC5=>:A:??B??D>=-@>A    XT:A:U    NM:i:0    X0:i:1    X1:i:0    XM:i:0    XO:i:0    XG:i:0    MD:Z:26
HWUSI-EAS566_0006:1:1:1019:11220#0    4    *    0    0    *    *    0    0    AGAAAGAGCTCATGCAGCGTCTTAAG    BC=B=C?DDBDBBB->>>>3B????B
HWUSI-EAS566_0006:1:1:1025:16447#0    16    chr1    183604878    25    26M    *    0    0    ACTGATTCCCTTCCCTCTTCCGCTCT    =@:>=:??-??9BABA?@+BD-=@C?    XT:A:U    NM:i:2    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:0C7A17
HWUSI-EAS566_0006:1:1:1026:12890#0    4    *    0    0    *    *    0    0    TGGCTAAGAGGGAGTGGGTGTTGCGG    DDDDA5>A:A==0>8C=C@C?DB-D=
HWUSI-EAS566_0006:1:1:1027:12527#0    4    *    0    0    *    *    0    0    CCTTCTCTCTTCCTCGGCGCTGCCTA    ECEAEBE?CEB??C=?BDBB:@@5=A
HWUSI-EAS566_0006:1:1:1028:10989#0    4    *    0    0    *    *    0    0    AGTTCCGGCGCAGCCGGGATTTGGGT    =EEEEDD-BDEEDEBEEA<@:>@@>?
HWUSI-EAS566_0006:1:1:1031:18950#0    4    *    0    0    *    *    0    0    GTAATGGCTTCAAGGACTATCAATGC    DE:-DD5CDCA?=C5)6>@2CC=5A=
HWUSI-EAS566_0006:1:1:1037:2137#0    4    *    0    0    *    *    0    0    GGCTTGCTAAGCAGAGGCCGGAAGCG    F:FFFBA?CC@=B:=@E@EB:????@
HWUSI-EAS566_0006:1:1:1037:19136#0    4    *    0    0    *    *    0    0    ATATCTGCTTCCGACACAGCTGCAAT    B5=:5@:C@:=BBAA5:A>:5?=:AB
ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Daniel Standage3.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1717 users visited in the last hour