We ran the PhraseWriter decoder described in the previous section on a corpus of sentences by simulating various touch input behaviors, and levels of abbreviation. The purpose of these studies was to understand the statistical characteristics of PhraseWriter as an instance of a C-PAK decoder, and to establish its basic motor action and error rate reduction limits in comparison to a baseline technique. We ran three simulations each focused on a special boundary condition of input behavior, with the first two simulations corresponding to two types of typing habits. These studies highlight the PhraseWriter decoder’s maximum benefits on these boundary conditions. The following section examines the extent to which these benefits were realized in its first hour of use.
The simulations were run on a sentence set sampled from the Enron email corpus [
22]. The corpus consists of real-world human communication and can be used with minimal privacy concerns. We randomly sampled
\(10 \,\%\) of the email bodies, segmented each body into sentences by punctuation characters surrounded by at least one space, and kept sentences that contained 2–10 words. All text was considered without casing. This yielded a corpus of 15,087 sentences containing 86,411 words (7,737 unique). Figure
6 illustrates the structure and process of our simulation studies.
5.3 Simulation I: Motor Action Savings with Error-Free Partial Input
In order to discover the upper bound of the keystroke savings of PhraseWriter as a C-PAK decoder, we simulated an error-free partial input condition to find the minimum number of keystrokes required to correctly decode each sentence with each decoder. This reflects a typing strategy of accurately tapping each key (i.e., within its key boundary), and selects from the top three suggestions when they match with the user’s intention. The user is assumed to evaluate the suggested candidates as frequently as needed.
For the baseline, it is straightforward to find the minimum number of keystrokes required: each letter is progressively typed within its boundary. Whenever one of the top three suggestions matches the intended word, it is selected.
For the C-PAK condition, we need to find the minimum set of prefix combinations that can produce the intended phrase, for which the number of possible combinations compounds exponentially with the number of words in the phrase. To do this we executed an exhaustive search to find the shortest partial input for the sentences in the corpus: for each sentence, we started with the initials of each word and progressively enumerated all possible prefix combinations until the first correct prediction was found in the top three suggestions. We cached the intermediate results and used heuristic pruning to accelerate the prefix enumeration and verification process.
Sometimes splitting a phrase into smaller commit units can further save keystrokes. For example, for the phrase “when will you come” the partial input “w w y c” has too many legitimate suggestions to be presented (e.g., “when would/will you come/call”); however, if the input is divided into “when will” with “w w” and “you come” with “y c”, the phrase can be entered successfully with just its initials. This does not affect the number of keystrokes required to enter the text (as taps on the spacebar have simply been substituted with taps on suggestions), but it requires the user to have a certain commit strategy.
We used a greedy algorithm to find these strategies and approximate the smallest number of suggestion taps needed for a particular set of prefix combinations. The algorithm tried to find the longest sequence of prefixes that could correctly complete part of the corresponding intended text, and then repeated this search until the entire phrase had been completed. This ensured that we could find the near-optimal suggestion-tap savings, while also achieving the best keystroke savings. However, users cannot be expected to know the optimal set of prefix combinations and commit units for all sentences. In practice, they may start to learn optimal prefixes for the most common phrases (e.g., “looks good to me” or “be right there”) and expand from there as their expertise develops (discussed later).
5.3.1 Results.
The best keystroke saving rate of the baseline decoder was \(46.3 \,\%\). In comparison, the best keystroke saving rate of the PhraseWriter decoder was \(49.4 \,\%\)–a \(6.7 \,\%\) improvement over the baseline.
The suggestion-tap saving rate (STSR) of the PhraseWriter decoder was \(52.6 \,\%\), with an average of \(2.39\) words entered per commit (rather than only 1 with the baseline decoder). When only considering sentences with at least four words, the suggestion-tap saving rate is raised to \(58.9 \,\%\), with an average of \(2.70\) words entered per commit.
5.3.2 Case Analysis.
The following example shows how the two decoders saved keystrokes and suggestion taps differently. The text in black represents the characters actually typed (the literal input), and the brackets indicate one commit via a suggestion tap. To enter the sentence “I’ll be at your place on Thursday to see her”, the input to the baseline decoder that maximizes keystroke savings is:
[I’ ll ] [b e ] [a t ] [y our ] [p lace ] [o n ] [T hursday ] [t o ] [s ee ] [h er ]
That is, entering the first one or two characters of each word and choosing the correct suggestion. In contrast, the input to the PhraseWriter decoder that maximizes keystroke savings is:
[I ’ll b e a t y our p lace o n T hursday t o s ee ] [h er ]
That is, entering the first character of the first nine words before choosing a completion, and then entering the first letter of the final word and choosing a completion.
For this example, both PhraseWriter and the baseline decoder achieve substantial keystroke savings in this ideal situation (PhraseWriter saves one more keystroke). However, the cost of evaluating and selecting suggestions differs. The PhraseWriter decoder can complete the sentence with only two commits, and in the first commit it successfully completes nine words with only its initials entered. Conversely, with the baseline decoder users have to choose from three candidates and make explicit selections for each word. This suggests that it is not sufficient to judge the potential motor cost savings by only referring to the keystroke saving rate, and PhraseWriter can potentially reduce the cost of switching between text entry and suggestion selection.
Overall, Simulation I shows that in error free conditions, PhraseWriter can increase keystroke savings from the baseline’s \(46.3 \,\%\) to \(49.4 \,\%\), and provides a suggestion-tap saving rate of \(52.6 \,\%\), using exactly the same language model.
5.4 Simulation II: Error Correction with Noisy and Complete Input
While Simulation I showed the maximum keystroke and selection tap saving, this simulation measures PhrasedWriter’s correction ability in a different boundary condition—all characters are entered so the language model’s power is entirely focused on error correction. In contrast to conventional word-level touch keyboards that decode one word at a time, PhraseWriter can take input at a phrase level. Intuitively, a greater bidirectional phrase level context could correct more errors from noisy input in touchscreen typing given the additional context from words across the phrase. In a very different simulation set-up, Vertanen and colleagues [
45,
48] have shown their (complete) phrase level decoding could reduce CER from
\(2.3 \,\%\) to
\(1.8 \,\%\). The current simulation studies the amount of error reduction the PhraseWriter decoder yields, as compared to a single-word forward-decoding baseline. Note that this single-word forward-decoding baseline still used the proceeding words up to the current word as context in decoding.
We made this measurement by simulating noisy (or “sloppy”), but complete, unabbreviated input. The simulated strategy is also a common one: typing quickly and imprecisely, paying minimal attention to suggestions, and relying on the correction capabilities of the decoder. Specifically, we simulated taps on the keyboard driven by a noisy input model such that they may fall outside the target key boundary.
For the baseline, the top suggestion was accepted after each word (i.e., on the space between words). For PhraseWriter, the top suggestion was accepted after the entire sentence was typed. In both cases, the entire sentence was typed without abbreviation.
The noisy input was generated from a Gaussian model with the same distribution parameters as Fowler et al. [
16] to reflect human-like typing behavior with a
\(9.8 \,\%\) error rate, as studied in Azenkot and Zhai [
4]. We did not simulate insertion or deletion errors.
5.4.1 Results.
The CER of the baseline decoder was
\(1.76 \,\%\), versus
\(1.53 \,\%\) for PhraseWriter. Although the baseline result was already very strong in comparison with prior studies [
16], the PhraseWriter decoder still decreased the CER by
\(13.1 \,\%\). The WER of the baseline decoder was
\(4.08 \,\%\), versus
\(3.48 \,\%\) for PhraseWriter (a decrease of
\(14.7 \,\%\)).
As another reference, the raw error rate without correction (i.e., the noisy input generated using a Gaussian model) would be 9.63% CER and 39.10% WER.
5.4.2 Case Analysis.
The following examples show a target sentence, an example of the characters typed (given the noisy input model), and the output from the baseline and PhraseWriter decoders. The errors in the noisy input and errors in the decoded outputs are underlined.
This example illustrates a case where the baseline erred when the PhraseWriter decoder did not: a spatial error at the beginning of the sentence that happened to convert the intended input to another common word (“is” \(\rightarrow\) “us”). The incorrect inference on the first word further increased the difficulty of decoding the second word correctly using the baseline. However, the PhraseWriter decoder performed better due to the re-ranking of candidates based on the subsequent context.
The following example shows a case where both decoders made errors:
In this example, the PhraseWriter decoder corrected “bt” into “by” correctly when the baseline decoder mistakenly corrected it into “but”. Both “but” and “by” are reasonable suggestions given the previous context, but with the benefit of the subsequent context it is obvious that “by” makes more sense than “but”. However, both decoders failed to correct “mafch” to “march”. This example shows that although the future context carries sufficient information, the current language model was not powerful enough to always make semantically meaningful suggestions.
Overall, Simulation II shows the bilateral sentence level decoding by PhraseWriter can improve WER from the word-level decoding baseline by \(14.7 \,\%\), as illustrated in the specific case analyses, even though both are powered by the same n-gram language model.
5.5 Simulation III: First Letter Noisy Input vs. Phrase Commonality
This simulation measures PhraseWriter’s phrase completion ability under another boundary condition: for each word in a phrase, only the first character is entered (e.g., “l g t m”
\(\rightarrow\) “looks good to me”). Given an
n-gram language model’s pruning and back-off algorithms [e.g.,
21], we expect that common phrases can be more accurately completed from their prefixes. We ran the first initial-character-only prefix input against sentences with different levels of commonality. Each sentence was 10 words long to ensure a reasonable challenge for the decoding task given a 5-gram language model.
To establish a list of common phrases independently, we used the number of individuals (
\(n=90\)) in the corpus who used a phrase as a measure of its commonality, rather than referring to the language score calculated by the language model. All sequences of between 2 and 10 consecutive words appearing in the corpus were considered phrases. We selected phrases that were used by at least 5, 20, 50, and 80 people to represent four levels of phrase commonality. Table
2 shows the coverage of these phrase levels—the percentage of words that can form a common phrase with adjacent words.
Although the number of unique common phrases is not high, the high coverage reflects a frequent use of them. If the PhraseWriter decoder can complete these phrases, it suggests that users could achieve considerable savings by focusing on a limited set of common phrases.
We simulated the input of each phrase using the noisy input model described for Simulation II. Only the first character of each word was input (including for the baseline decoder) and the top suggestion was selected after each step.
5.5.1 Results.
We calculated the keystroke savings rate and CER on different subsets of the corpus: (1) all text, (2) common phrases (four levels of commonality), and (3) uncommon expressions (text not appearing in any common phrase). Each subset is tied to a different commonality, and introduces a different view of the results.
The PhraseWriter decoder consistently performed better on all subsets of the dataset over the baseline in terms of error rate (Table
3), and the benefits of PhraseWriter were more salient for common phrases.
From Table
3, we can also see that the keystroke savings rate with a fixed, one-letter prefix is smaller for common phrases than other subsets, which indicates that common phrases often consist of words that are relatively short, and the common phrase group with a higher commonality contains even shorter words in general.
5.5.2 Case Analysis.
Here, we show how the decoding accuracy varies with the commonality of the expression with two examples. All errors are underlined.
In this example, the target sentence contains many common expressions such as “I’ll be” (55 users), “at your” (60 users), “your place” (17 users), “on Thursday” (66 users), and “to see” (88 users). The PhraseWriter decoder almost completed the entire sentence correctly, while the baseline completely missed the target sentence. This example shows that longer context allows the PhraseWriter decoder to complete common expressions correctly with very little input—corresponding to the major improvement demonstrated in the statistical results.
In this example, the target sentence conveys a more specific meaning and contains fewer common phrases. Both the baseline and PhraseWriter decoders produce an incorrect prediction.
We also tested a variation of this example, using the first two-letters of each word:
Although the predictions are still far from the target sentence, the PhraseWriter decoder was able to complete the common expression “is named after”. This suggests that users may need to adjust the length of their input based on the commonality of the expression.
Overall, this simulation study shows PhraseWriter as an instance of C-PAK decoder has an higher potential success rate with more common phrases.
5.6 Discussion
Using these computational experiments we can obtain a better understanding of the C-PAK style input method. The findings help us understand the upper bound performance that can be achieved by PhraseWriter (Simulations I and II), as well as a strategy implication for using PhraseWriter (Simulation III). To illustrate the implications of our findings from a more holistic perspective, we reflect on the five main task components that users undertake when using a predictive keyboard:
(1)
The preprocessing and planning of the characters to be typed.
(2)
The execution of typing actions to enter the characters.
(3)
The evaluation of suggested completions.
(4)
The selection or rejection of correct or incorrect completion suggestions, respectively.
(5)
The cognitive task of estimating, learning, or recalling the abbreviated strings.
Simulation I showed that PhraseWriter can achieve more keystroke savings than word-level input, suggesting that it could reduce the frequency of task component 2 and the cost of 1—relaxing the requirement for spelling accuracy. PhraseWriter can also reduce the context switching between text entry and suggestion selection as suggested by the 52.6% STSR, which means that it could reduce the frequency of task component 4. However, each suggestion evaluated during task component 3 will be longer (phrase vs. word). Furthermore, task component 5 is a unique challenge to PhraseWriter (and to C-PAK style input in general).
Simulation II suggested that the PhraseWriter decoder can reduce more errors than word-level input. The different benefits achieved in the first two simulations also suggests that different target users (frequent word completion users and full sentence typing typists) may find C-PAK style input more helpful in different aspects of their text entry experience.
Finally, Simulation III showed that C-PAK style input may be especially effective for entering common phrases.
While the simulation studies show the improved performance upper bounds of PhraseWriter as a practical implementation of C-PAK style input from the traditional keyboard, both in keystroke savings and in error correction abilities, the magnitude of these improvements are limited by the decoding technology. How much further improvement can be made to these upper bounds with more advanced technologies, such as large neural network language models, is a topic for future research.