Leveraging Flow Order and Flow Signals
Posted by Nils Homer on Nov 8, 2011
A major strength of the Ion Torrent semiconductor sequencing system is that a SNP is not equivalent to an error, or more precisely single base substitutions are not equivalent to a single sequencing error. While the base calls that are provided are accurate, even greater levels of performance are available by taking into account flow orders and flow signals.
Homopolymers are observed one-at-a-time during sequencing, with the given homopolymer base determined by the flow order. For example, a flow order of ACGT would cycle through each base one a time. More intricate flow orders, like TACGTACGTCTGAGCATCGATCGATGTACAGC, can yield significant advantages to overcome carry-forward (signals from out-of-phase templates) and incomplete extension (templates that do not fully extend). Furthermore, flow orders also imply something about the likelihood of an error versus a variant (more to come below).
While base calls are easy to use and fit the expectation of the output of the system, flow signals are also given, namely the predicted amount of incorporation for each flow. The flow signals (or flowgram) with carry forward and incomplete extension removed is stored in an SFF file, along with the flow order and other useful information. A 5-base homopolymer in the base calls (FASTQ) can correspond to a flow signal between 4.5 and 5.5. If the flow signal is 4.5 versus 5.0, the "fifth" base could be lest trustworthy than the previous four. This may all be captured in base qualities, but consider a signal of 5.5 versus a signal of 5.0. The flow signal of 5.5 confirms strong evidence that there are at least five bases observed, but possibly six. The base quality scores directly map to a particular called base, and so base qualities cannot encapsulate the concept of missing or added bases. There would be no base quality for the "sixth" potential base, and so information is lost.
Consider the following: overcalls and undercalls do not change the number of observations in the flow sequence. Denote the flow order as F, the read/query as Q, and the reference/target as T. For example, for an inserted C, the flow sequence:
F: ACGTAC Q: 112002
cannot change into:
F: AGTAC Q: 12002
since the flow order would be altered; the flow order is always fixed prior to sequencing. But the first flow sequence can be changed into:
F: ACGTAC Q: 102002
since the flow order does not change, just the flow signals themselves (the second flow (C) changes its value from 1 to 0). While the latter two flow sequences imply the same base sequence (ACGGCC), they are fundamentally different in flow space.
Extra pieces of information are available to help distinguish between errors and variants. Examples:
Example #1: A single base mutation in a homopolymer stretch is not likely to be an error.
Q: TCTC | || T: TTTC
The above alignment implies the query and target differ by a single base. How easy is it to explain the query by overcalls and undercalls? Suppose the flow order here is TCTGC. The flow sequence for the query is 11101, while the flow sequence for the target is 31. What this says is that if the C was truly a T (an error in observation), the three Ts should have all been observed in the first flow, and that the ending C, should have been observed in the second flow. Well the ending C is observed in the fifth flow. Therefore, if the underlying sequence was truly TTTC instead of TCTC the system would have observed the sequence in different flows altogether! Lets look at a flow space alignment:
F: TCTGC Q: 11101 |??? T: 31
You can see the problem more clearly, the two sequences are out of phase compared to each other. This comes down to how the flow order constrains the observations. The first sequence cannot be logically transform into the other using only overcall/undercall edits according to the flow order, since it would then imply a totally different flow sequence, which disagrees with the observations. Observe that two base sequences of the same lengths may have vastly different flow sequence lengths.
Example #2: a single base mutation not likely to be an error.
Q: ATG | | T: ACG
Suppose the flow order is ACGTACGT. Lets go through the same exercise as in Example #1. The flow sequence implied by the query is 1001001, while the flow sequence implied by the target is 111. Clearly the two flow sequences are out of phase, and so a combination of overcalls/undercalls cannot explain this single base difference. It is far more likely that this alignment pattern represents a real variant as opposed sequencing error.
F: ACGTACG Q: 1001001 | ???? T: 111
Example #3: a single base mutation could be a combination of errors.
So far the two examples above show when flow order can lead to a decreased likelihood of error. Lets look at how flow order can lead to an increased likelihood of error.
Q: TAGT | || T: TCGT
Suppose the flow order is TACGT. Then the flow sequences are 11011 and 10111 for the query and target respectively. So far so good, the two flow sequences are of the same length. Lets look at them in flow space.
F: TACGT Q: 11011 | || T: 10111
Now the two flow sequences are in phase (same length, start/end on the same flows). But now Q can be transformed into T by simply reducing the flow signal by one for the second flow, and increasing the flow signal by one on the third flow: two observational errors. Now there are two hypotheses:
The query and target differ by one base (a mutation). The query and target are the same, but the query has two errors (no mutation).
Which hypothesis to prefer depends on many factors, including having multiple reads observing this mutation (but what if a low frequency mutation is being observed?), a priori error rates, and the likelihood of a mutation in this sample. Of course this is all handled by a variant caller that is hopefully aware of flow order.
Example #4: using flow signals to add some more color.
Example #3 shows a case where it is difficult to distinguish a combination of errors versus a variant. While base qualities may help, there would be no base quality for uncalled bases (i.e. the empty C in the third flow). This information is useful as seen above. But flow signals can help color the situation.
Looking back at example #3, suppose the flow signal call was 0.53 for the A and 0.49 for the C. This could dramatically increase the likelihood of error, since the overcall of an A is now much closer to zero than the integer base call one, and the undercall of a C is much closer to one than the integer base call of zero. When using a base call, the two flow sequences are two units of edit away (one for the A, one for the C). Coloring the example with flow signals, the two flow sequences are 1.04 edits away ((0.53-0) + (1-0.49)). A variant may still be called, but with much weaker evidence than originally thought given the flow signals.
On the other hand, suppose the flow signal call was 1.20 for the A and 0.01 for the C. The A signal is strong, while the C signal is almost non-existent. The edit distance would be 2.19 ((1.20-0) + (1-0.01), which would mean that the mutation hypothesis is strengthened with the use of flow signal information.
Example #5: changing an insertion to a mutation.
This pattern was used to detect a low frequency variant in a real-world sample.
Q: TCCCGT ||||-| T: TCCC-T
An inserted G could be inferred from this alignment. However, if the flow signal is 2.51 for the C and 1.01 for the G, an alternative alignment may be:
Q: TCCGT ||| | T: TCCCT
The flow signals give an alternative hypothesis, namely a single base mutation. Which hypothesis to prefer depends on observing this haplotype in other reads, the priors on observing the various types of variation, and the priors on this type of error.
Notwithstanding, these examples show single reads and single alignments. Suppose in the simplest case of resequencing a haploid genome, any single homopolymer would be observed by multiple reads. Using the mean flow signal (or weighted sum of flow signals) would yield a more accurate estimate of the homopolymer length than using the mean homopolymer length implied by the base call. Flow signals also have value when aggregated.
The examples also show that flow orders can help distinguish between errors and variants. The constraints a flow order makes not only depend on the underlying sequence context, but also where the sequence is observed within the flow order. This is why alternate flow orders (i.e. more complex flow orders) are potentially powered to detect variation. If multiple reads observing the same variant, it is likely the reads will observe the variant at a different position in the flow order. Therefore while one read may have ambiguity about the variant, another read in a different flow order context may absolutely constrain the set of possibilities. Hence the power of flow order.
Flow sequences (flow order and flow signals) contain evidence about the absence of incorporation, specifically in empty flows. Without the flow order, the position of empty flows is not known and information is lost. Common alignment formats like the SAM format cannot represent alignments in flow space, but do support storing flow order and flow signals fortunately. This allows the conversion of a SAM record into flow space. You can find a tool released in the 1.5.1 Torrent Suite called "SamToFlowSpace.jar", found in the AmpliSeqVariantCaller plugin.