Forum: Stranger Things: unexpected limitations of popular tools
16
gravatar for Istvan Albert
13 days ago by
Istvan Albert ♦♦ 73k
University Park, USA
Istvan Albert ♦♦ 73k wrote:

Twice in just as many days, we were left scratching our heads as mainstream tools broke unexpectedly and spectacularly on what seemed to be normal tasks.


Cigar Break

For example take the SAM file that we now fittingly named cigarbreak.sam

 samtools view -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.bam

thanks it's close, but no cigar,

E::bam_write1] too many CIGAR operations (95243 >= 64K for QNAME "placed")
samtools view: writing to standard output failed: Value too large to be stored in data type

Turns out this particular SAM file cannot be turned into a BAM file. Suprised? Rightfully so. Why does the BAM format limit how many operations may be present in the alignment?

EDIT:

Summarizing the comments/answers:

  1. This is limitation is caused by the way the BAM format stores the value internally 16 bits short = 2^16 = 65536 max operations
  2. The CRAM format does not have this limitation.

Align it like it's still 1999

Here is another unexpected behavior. The hisat2 aligner will, occasionally, refuse to align a read that has an insertion in it.

wget http://data.biostarhandbook.com/strange/reference.fa
wget http://data.biostarhandbook.com/strange/query.fa

bwa index reference.fa
hisat2-build reference.fa reference.fa

The two sequences are identical except the query has an insertion after position 49. Let's do the alignments:

bwa mem reference.fa query.fa | cut -f 1,2,6

among other things it prints:

sequence    0   49M1I134M

This is the correct answer. The query sequence has 49 matches, 1 insertion followed by 134 matches. There are no mismatches by the way. Now let's run the same sequence via hisat2:

hisat2 -f -x reference.fa query.fa | cut -f 1,2,6

the output is:

sequence    4   *

the query sequence is not aligned at all. Make no mistake hisat2 is supposed to be an indel aware aligner. It fails for this particular sequence. Amusingly we detected this when we used hisat2 to "check and validate" results produced via bwa mem.


Post your own "stranger things" observations as followup answers.

hisat2 errors samtools forum • 1.4k views
ADD COMMENTlink modified 12 days ago by dariober7.9k • written 13 days ago by Istvan Albert ♦♦ 73k

Turns out the cigar string limitation was already mentioned here: Splitting SAM/BAM alignments

ADD REPLYlink modified 13 days ago • written 13 days ago by Istvan Albert ♦♦ 73k

The hisat2 example seems like a bug to me. Did you fill in a bug report?

edit: turns out the hisat2 indel issue has been reported before (Q: HISAT2 (graph index) alignment over InDels), and vivekbhr reported the issue to the authors.

edit 2: which version of hisat2? There is one fresh out of the oven - HISAT2 2.1.0 release 6/8/2017 - which maybe fixes the issue?

ADD REPLYlink modified 13 days ago • written 13 days ago by h.mon7.5k

Fascinating looks like there is nothing new under the Biostar. Both issues have already been reported

In my observation, in general, insertions and deletions are handled correctly - the failure only occurs in some cases.

I've checked out the latest source of hisat2, the problem still persists.

Filed an issue here: https://github.com/infphilo/hisat2/issues/129

ADD REPLYlink modified 13 days ago • written 13 days ago by Istvan Albert ♦♦ 73k

Figuring out what to put effort into allocating memory for can be a hard problem, at the start, especially for tools written when memory and disk resources were considerably more constrained. Programming is hard work! Especially when feature creep sets in and people expect more from a format than it initially allows.

ADD REPLYlink modified 12 days ago • written 13 days ago by Alex Reynolds19k
4
gravatar for WouterDeCoster
12 days ago by
Belgium
WouterDeCoster20k wrote:

R has a weird bug in which counting of arrays starts with index 1.

ADD COMMENTlink modified 12 days ago • written 12 days ago by WouterDeCoster20k

The only language that gets this correct is Fortran, where arrays can start at any number you want :) Yes, knowing this and having used Fortran makes me old.

ADD REPLYlink modified 12 days ago • written 12 days ago by Devon Ryan69k

As far as I know also Julia does this. Which is nice, because initially, it started at 1...

ADD REPLYlink written 12 days ago by WouterDeCoster20k
3
gravatar for Pierre Lindenbaum
13 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum96k wrote:

Why does the BAM format limit how many operations may be present in the alignment?

currently the number of cigar operations is stored in a 16 bits short = 2^16 = 65536 max operations

https://github.com/samtools/htslib/issues/437

ADD COMMENTlink written 13 days ago by Pierre Lindenbaum96k
2

Good to know - but it does demonstrate very well what Donald Knuth called "premature optimization is the root of all evil" (in software development perspective that is).

The SAM/BAM formats are choke full of premature optimizations, from the weird and semi useless flag system, to the two kinds of (different) CIGAR strings (column 6 and the MD tag) the weird tagging system, that the index is not part of the file but a separate one, etc .

It was all developed in the era when they thought shuffling back and forth the tens of millions of 76bp long Illumina sequencing is what bioinformatics will always be about.

ADD REPLYlink modified 13 days ago • written 13 days ago by Istvan Albert ♦♦ 73k
5

I always think the biggest mistake in SAM (not BAM) is the integer flag because it is not human readable. I proposed letter flags, using one letter for one flag bit, but the proposal fell through. Probably I should have acted more firmly at the time. On the other hand, I am not sure how binary flag in BAM is a big issue. How would you encode strand, mate strand, unmapped, read1/2, etc efficiently and compactly?

The second major mistake is the 65535 cigar limitation. The decision was short-sighted. A pull request is pending to fix it for future tools.

On CIGAR+MD, MD can be regenerated and is unimportant. Furthermore, there is not an efficient and convenient way to encode CIGAR and MD together – you would need variable-length array and heap allocation, which is slow, error prone and hard to work with.

A separate index is due to stream-ability. It was not clear how to achieve both at the beginning. Later after we found the EOF marker, we realized it is possible to append the index to the end of BAM, but apparently not many were interested in this.

SAM was designed with full genome alignment in mind. That's why it has hard clipping and uses binning index, which would not be necessary for 36bp reads at the time. Due to the lack of real-world long-read data and applications at the time, SAM did not work well with long reads; the situation is not much improved even with the addition of supplementary alignment and SA several years later. I asked myself from time to time how we could design a better format for long reads if we had started over, but I could not with all the corners we have to cut here and there. It is easier to complain than to come up with a good solution.

The internal of a program can be continuously optimized without interfering with users. A format can't because later optimization may break compatibility, arguably the most important factor to a format. You have to bake early optimization decisions upfront even for imaginary use cases (e.g. long reads). A few are mistakes but some are still amendable (e.g. long cigar). The rest make BAM survive today.

ADD REPLYlink written 12 days ago by lh330k
1

First of all, I do recognize that criticizing and commenting - especially in hindsight is trivial. The premature optimization problem is that it leaks into the solution - we are only looking at things that we think we can optimize today.

The main problem with the SAM format is that it relies too much on the existing paradigms - where the order of columns carry information, where we must think in terms of a fixed number of columns plus a variable number of unordered tags etc. It is not an extensible format that allows it to grow seamlessly and allow new information in the future, it is not a format that could replace other formats - it only adds one extra burden.

The right way to design SAM would have been to make it so that it also offers a viable replacement to all interval type data - BED and GFF (as well as to the and the endless variety of various tab delimited formats). A format where no conversion to another similar format would ever be necessary.

For that a shift in thinking is required - it is like "duck-typing" in high level languages. Any file should work as a SAM file as long as it has the columns that a SAM format requires. Adding 100 more columns to a SAM format should not matter. Adding columns in a different order should not matter. Since the index is a separate entity anyway, there is a lot of room to implement indexing and optimization that allows for fast access to the underlying data.

I think the universal bioinformatics interval format is a header and column defined, sortable (offering multiple indices), bgzippable tab delimited format - working the way similar to how tabix works. Making this file compact, efficient etc should be independent of what the file actually contains.

The job of the "standardization" committee would be to prescribe what each column means what: for example that the header specified as start has to be a long, positive, non-zero integer in a 1-based coordinate system. Then, in addition, the committee could state that for a file to be a valid SAM it has to have a certain number of columns and what each of those columns must contain.

Note how much simpler life has just become. A tool can look at a file and provide meaningful feedback on what might be missing from a file and how to add that information. There is no need to "convert" files - this is the key - no conversion would ever be needed. You'd only need to add information to a file without losing what is already there.

ADD REPLYlink modified 12 days ago • written 12 days ago by Istvan Albert ♦♦ 73k
2

My understanding is that to an alignment format, you still propose to have fixed columns and optional columns. The major difference from SAM would be that each line ought to have a fixed number of optional columns, as are defined in the header. This could work, but I am not sure how much it improves over SAM. The committee can't freely add fixed columns. Doing this pushes the compatibility issues to the downstream tools. In addition, the several problems with BAM you talked about (flag, cigar and MD) are due to binary encoding. Your proposal will be exposed to the same potential pitfalls.

ADD REPLYlink written 12 days ago by lh330k

Can't you achieve most of what you propose by just adding user-defined SAM tags? These are similar to "optional" columns, and have the benefit of namespacing and static typing. I guess one big loss here would be the ability to naively sort the file on tags, but sorting shouldn't be too hard, and would be just another reason to rely on a samtools API.

I think you have three features: format simplicity, ease of parsing, and ease of validation, but it's hard to optimize all three.

ADD REPLYlink written 11 days ago by Matt Shirley7.6k

Tags are the problem - they turn a column based format into a record based format.

The file is now a mix of vertical columns and unordered horizontal records where you have to fully parse a row to figure out what type of information it might contain.

Tags should be the last resort of adding information into a file. Used when we need to store a piece of information that is very peculiar and it is still important to add that. Note what they are called: tags. Yet tags are first class citizens today, essential information that characterizes every single alignment has to be a tag, AS, MD etc because there is no other way to add a new column.

As I said before the real problem is the paradigm: we need to disentangle the order of a column from what the columns contain. If we could do that all pieces fall in place. But as long people think in terms of: "the first column will contain this, the second columns will contain that..." we'll keep ending up with files where the utility diminishes greatly over time.

It is similar to how databases work - you don't create an SQL query of column 1 to column 2 just because you know what is in column 1. The content of a column and its order should be immaterial to the action that we perform.

ADD REPLYlink modified 11 days ago • written 11 days ago by Istvan Albert ♦♦ 73k

"the first column will contain this, the second columns will contain that..."

Isn't a definition like that required to codify a format though?

ADD REPLYlink written 11 days ago by genomax31k

the reason that we have some many formats is that scientists confuse the "information" that a file contains with the order of the columns

start    end 
100      200

and

 end     start
 200     100

are identical from the point of view of the useful information that they contain. But somehow, in bioinformatics, it is "impossible" to even think about that these two files as being identical. Hence the paradigm shift that is needed.

Encoding external information into the order of the columns is like saying that the same file will mean something else when it is in a different folder. It kind of works - until it all breaks down.

ADD REPLYlink modified 10 days ago • written 11 days ago by Istvan Albert ♦♦ 73k
1

While I agree with your point, during analytical process one needs to be able to examine a file/dataset to make sure that things are on track. Having ordered data (at least during the manual verification) makes this simple for mortals (like me). As an end user I don't care how the information is internally stored as long as it is done right.

ADD REPLYlink modified 10 days ago • written 10 days ago by genomax31k

Having ordered data is also nice for interoperability with normal unix tools, since you don't need to tweak your awk commands after looking at headers.

ADD REPLYlink written 10 days ago by Devon Ryan69k

that's why bioawk exists to do just this - Heng actually has had all the right ideas all along - bioawk, tabix these are the tools we should build bioinformatics around - and not a new formats.

ADD REPLYlink written 10 days ago by Istvan Albert ♦♦ 73k

Why would it be simpler to validate a file format where you have to know beforehand that column 2 is start versus a tabular file where the header tells you that you have a column named start?

What is simpler to communicate to a "mere mortal" that their file is missing the column named transcript id versus saying they need a correct GTF file format?

You can still validate and enforce formats to be GTF and BED and GFF if they have the right columns, but now you have the advantage that you could store other columns.

You could have a file that is BED and GFF at the same time. Once you have genes.bed.gff now all of a sudden we don't need to convert we don't need to keep the same information in two different files.

ADD REPLYlink modified 10 days ago • written 10 days ago by Istvan Albert ♦♦ 73k

I was referring to visual examination of the contents (where it is easier to follow left-to-right/top-to-bottom order) and not validating file format (which would be hard to do visually).

A side note. While this order may work for many it must be equally foreign to those who use languages that are written from right-to left.

Would the order-free columns have a key at the beginning of the file that tells where a data column will be located in the file so the headers do need to be parsed?

ADD REPLYlink written 10 days ago by genomax31k

yeah ! there's only RDF ! let's use RDF to encode BAM !! :-)

enter image description here

ADD REPLYlink modified 10 days ago • written 10 days ago by Pierre Lindenbaum96k

I guess I just don't see too much of a problem with record-based formats. If you need to efficiently index columns, you'll need a database. For most unix tools you're relying on readline and so any operation on columns has to read the entire row anyway. With a record-based format you're at least validating the content of each row/column as you iterate through the file, and I gather this is your main issue - that parsing a record takes more time than splitting a line on fixed delimiters. For most of my use cases it's not the record parsing efficiency that needs optimizing.

ADD REPLYlink written 10 days ago by Matt Shirley7.6k

I can understand why you say yours is a paradigm shift: optional columns unify multiple formats, but optional tags don't. If you had implemented this 20 years ago and a toolbox working with it as efficiently as other unix tools, your proposal might have greatly influenced bio format design – I can't really predict. If we look at it today, unfortunately, it is not as appealing. As I said above, as an alignment format, it is not really better than SAM. When you go down to binary and implementation, your proposal will be less efficient than BAM/htslib, both speed and memory wise, and will probably have less convenient C APIs (e.g. you can't access data directly via a plain C variable). These are the trade-offs you pay for flexibility. That said, I am actually interested in trying some of your ideas, but won't be soon.

ADD REPLYlink written 10 days ago by lh330k

As far as I know CRAM can handle the longer CIGARs. It's not the first time that I read about this issue... See for example also this thread Splitting SAM/BAM alignments

ADD REPLYlink written 13 days ago by WouterDeCoster20k
1

yes indeed it does.

efetch -id NZ_AP014854 -format fasta -db nuccore > NZ_AP014854.fa
samtools view -O cram -T NZ_AP014854.fa -b http://data.biostarhandbook.com/strange/cigarbreak.sam > output.cram

If there ever was a reason to switch to cram this is it.

ADD REPLYlink written 13 days ago by Istvan Albert ♦♦ 73k

weird and semi useless flag system

Amen to that.

ADD REPLYlink written 13 days ago by genomax31k

It is greatest flaw beyond all that is that is both confusing and plain wrong. The forward strand should not be thought of as the opposite of reverse, first in pair is not the opposite of second in pair and so on. This generates endless confusion, an immense waste of effort and money that is omnipresent yet hard to capture, demonstrate and quantify.

ADD REPLYlink written 13 days ago by Istvan Albert ♦♦ 73k
2
gravatar for Philipp Bayer
13 days ago by
Philipp Bayer4.5k
Australia/Perth/UWA
Philipp Bayer4.5k wrote:

SOAPaligner hasn't been updated since 2011 and still replaces N by G in alignments: SOAPaligner 2.21 - does it replace all Ns by Gs in reads?
But then again, I don't think people still use this aligner.

OrthoMCL seems to have a bug when calculating paralogs where it keeps self-hits - PorthoMCL (reimplementation without MySQL) has a fix but also a flag to keep that bug for 'better' reproducibility: https://github.com/etabari/PorthoMCL/wiki/Manual#step-5-finding-best-hits
Description of the bug: https://github.com/etabari/PorthoMCL/wiki/Comparison#orthomcl-bug

ADD COMMENTlink modified 13 days ago • written 13 days ago by Philipp Bayer4.5k
0
gravatar for microfuge
12 days ago by
microfuge610
microfuge610 wrote:

Use of gff/gtf by so many tools. Although an efficient format and hierarchical but still a nightmare. Bed12 is so much easy. Never managed to cleanly convert gff/gtf-> bed12.

ADD COMMENTlink written 12 days ago by microfuge610
2

You can't store a full GTF worth of info in bed12, the entries are always transcript-centric.

ADD REPLYlink written 12 days ago by Devon Ryan69k
1

The main problem with the BED format is that you can only store a very limited and predetermined type of information.

ADD REPLYlink written 12 days ago by Istvan Albert ♦♦ 73k
0
gravatar for h.mon
12 days ago by
h.mon7.5k
Brazil
h.mon7.5k wrote:

A very annoying (and wasteful, really) feature of several of the public databases of raw reads is that uploaders are not required to provide the phenotype associated with the reads.

ADD COMMENTlink written 12 days ago by h.mon7.5k

Having uploaded data to the SRA I never had any phenotypes for that data in the first place (but this is, in my opinion, one of the biggest problems in bioinformatics - mountains of genomic data, hills of phenotype data)

ADD REPLYlink written 12 days ago by Philipp Bayer4.5k

Could not agree more! I would not even call it hills...

ADD REPLYlink modified 11 days ago • written 11 days ago by agolicz10
0
gravatar for dariober
12 days ago by
dariober7.9k
Glasgow - UK
dariober7.9k wrote:

This may be a question in itself but maybe it fits well in this thread...

What's the purpose of the SRA format? Every question about it seems to be about converting from sra to fastq which makes me wonder whether the sra format was at all necessary.

(Then of course, in hindsight it's easy to be critical and say this should have been done better)

ADD COMMENTlink written 12 days ago by dariober7.9k
2

What's the purpose of the SRA format?

To annoy the rest of the world, seemingly.

ADD REPLYlink written 12 days ago by Devon Ryan69k
2

The purpose of SRA format is to have a column-oriented datastore where each column can have a separate datatype-specific compression applied. Sequences, sequence names, quality scores, alignment information... all get compressed more efficiently than just gzipping all the data. Rows in SRA are individually addressable, so you can get a specific subset of the data without reading the whole file. The reference assembly is also linked to the data, so any operations that require comparison to the reference, e.g CRAM-like compression or calculating SAM tags, will be done against the correct genome.

@genomax is right that it's for NCBI's internal use, and I totally agree that they should have provided an easier way to dump FASTQ files without requiring their client library.

ADD REPLYlink written 12 days ago by Matt Shirley7.6k
1

It perhaps made sense in how the data is handled internally at NCBI. They should have made it easy to recover fastq files transparently though. Fortunately EBI-ENA provides that functionality for most SRA accessions.

If I recall right, implementation of SRA database was done by external contractors. Rest you can imagine.

WikiPedia has the following quote:

Internally the SRA relies on the NCBI SRA Toolkit, used at all three INSDC member databases, to provide flexible data compression, API access and conversion to other formats such as FASTQ

ADD REPLYlink modified 12 days ago • written 12 days ago by genomax31k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1174 users visited in the last hour