Nothing Special   »   [go: up one dir, main page]

Skip to content
BY-NC-ND 3.0 license Open Access Published by De Gruyter June 21, 2013

An Approach for Generating Pattern-Based Shorthand Using Speech-to-Text Conversion and Machine Learning

  • K. R. Abhinand and H. K. Anasuya Devi EMAIL logo

Abstract

Rapid handwriting, popularly known as shorthand, involves writing symbols and abbreviations in lieu of common words or phrases. This method increases the speed of transcription and is primarily used to record oral dictation. Someone skilled in shorthand will be able to write as fast as the dictation occurs, and these patterns are later transliterated into actual, natural language words. A new kind of rapid handwriting scheme is proposed, called the Pattern-Based Shorthand. A word on a keyboard involves pressing a unique sequence of keys in a particular order. This sequence forms a pattern that defines the word. Such a pattern forms the shorthand for that word. Speech recognition involves identifying, by a machine, the words spoken by a speaker. These spoken words form speech input signals to a computer that is equipped to correctly recognize the words and do further action, such as convert it to text. From this text input, unique shorthand patterns are generated by the system. The system employs machine learning to improve its performance with experience, by creating a dictionary of mappings from word to patterns in such a way that the access to existing patterns is faster with progression. This forms a new knowledge representation schema that reduces the redundancy in the storage of words and the length of information content. In conclusion, the speech is converted into textual form and then reconstructed into Pattern-Based Shorthand.

1 Introduction

1.1 Rapid Handwriting

Rapid handwriting involves quick transcription of the words spoken. This transcription is performed using either the natural script of a language such as English, or a special script such as shorthand. Shorthand is a system for rapid writing that uses symbols or abbreviations for letters, words, or phrases [10]. Various forms of shorthand exist in modern times, the most popular ones being Pitman and Gregg shorthand (Figures 1 and 2). Most shorthand schemes are phonetic, meaning that they are based on how a word sounds rather than how it is spelt. This characteristic is exploited, and similar symbols are written for similar sounding letters.

Figure 1 Pitman Shorthand.
Figure 1

Pitman Shorthand.

Figure 2 Gregg Shorthand.
Figure 2

Gregg Shorthand.

In both Pitman and Gregg schemes, the thickness, length, and position of the strokes are all significant. Shorthand schemes provide a word transcription at the rate of around 120 words per minute (WPM) [6], whereas regular typing is 33 WPM and composition is 19 WPM [4]. An average professional typist usually types at speeds of 50–80 WPM [1].

1.2 Pattern-Based Shorthand

For every word that can be typed on a computer keyboard, a definite typing pattern exists. A computer keyboard has a certain layout for its keys. Various keyboard layouts exist, such as QWERTY, DVORAK [3], and ATOMIK [5, 7]. For example, on the popular QWERTY layout (Figure 3), to type the word “TAN,” the finger goes to the first row and presses T, then goes to the second row and presses A, then goes to the third row and presses N.

Figure 3 QWERTY Keyboard Layout.
Figure 3

QWERTY Keyboard Layout.

This path of pressing the keys can be imagined as a continuous pattern consisting of vertices and line segments. The ATOMIK keyboard layout (Figure 4) provides a unique pattern for each English word. Using this layout, the shorthand scheme offers a constant line width, definite length, well-defined edges, and overall performance optimization [14] for the written pattern for each word. A pattern for an English word cannot be confused with a pattern of any other English word in this layout, as ascertained during the testing of the system; this layout provides a distinguishable pattern for a word.

Figure 4 ATOMIK Keyboard Layout [14].
Figure 4

ATOMIK Keyboard Layout [14].

Such a pattern can form the shorthand symbol for the word, and the same pattern is drawn by a person during transcription (Figure 5). In Figures 5–8, the phrase for the word pattern is written below the pattern for reference. The number indicates which row of the keyboard is the starting point of reference. Certain writing rules exist, such as writing a loop for repeated letters (as in “roof;” Figure 5), stopping the pattern writing after a particular degree of identification is obtained, and so on.

Figure 5 Patterns for Words.
Figure 5

Patterns for Words.

Figure 6 Pattern for “Posit.”
Figure 6

Pattern for “Posit.”

Figure 7 Pattern for “Than.”
Figure 7

Pattern for “Than.”

Figure 8 Pattern for “Week.”
Figure 8

Pattern for “Week.”

1.3 Speech-to-Text Conversion

Speech recognition involves translation of human speech into written form. Speech recognition (or speech-to-text conversion) systems process the human voice as signals, map voice patterns to specific words and phrases, and output these as ASCII text. This text forms the input to the Pattern-Based Shorthand system, which converts it into the pattern based on machine learning techniques.

The recognition of speech is implemented using Dragonfly [2] and Windows Speech Recognition [12]. Dragonfly is a speech recognition framework that offers a high-level object model and allows its users to easily write scripts, macros, and programs that use speech recognition [2]. Windows Speech Recognition provides an interface to provide Dragonfly with the speech signals.

1.4 Dictionary-Based Machine Learning

Machine learning is the study of computer algorithms that improve automatically through experience [8]. There are various forms of machine learning techniques, and the dictionary-based machine learning is a simple form, and the same is being used in this system. A dictionary consists of a mapping from a key (K) to a value (V), as given in Eq. (1).

Here, F assigns a certain meaning to the key K to provide a value V. To illustrate the process of this machine learning technique, consider a machine that must learn and provide the meaning of a word. The first time the program runs, it does not know the meaning, and hence, it has to be provided with the meaning by a human. Then, a dictionary is constructed consisting of the meaning of the word, and this dictionary is later accessed by the machine. This makes the machine learn the meaning of the word, which enables faster access. Indeed, access time is an important performance measure to indicate the efficiency of the learning process.

This forms a methodology for dictionary-based machine learning where a system learns and optimizes the knowledge representation based on information already learnt or mapped. This kind of knowledge representation provides the advantage of reduced space to store the mappings, reduced length of information content, and elimination of the need to store a complex grammar of the language. It also solves the problem of redundancy by avoiding recreation of the meaning of an entity by recursively using the meanings already present in its repository of knowledge and information.

2 Methodology

2.1 Conversion from Speech to Text

The speech recognition system converts the spoken words into text, and this text is later converted to shorthand pattern. This is a continuous process, and the text-to-pattern conversion follows the speech recognition. The speech recognition is performed using the Dragonfly Python module, which obtains the input through the Windows Speech Recognition engine. Windows Speech Recognition takes the speech from the input device, such as a microphone into which a user speaks, and provides these speech signals to Dragonfly. Dragonfly is used to convert this speech into text that is written into a text file for further processing.

Dragonfly allows easy and intuitive definition of complex command grammars and greatly simplifies processing recognition results [2]. It works on the concept of mapping the speech to text; predefined rules must be created for these mappings. It keeps listening to text till a pause is encountered and these mapping rules are applied recursively. The following is such a rule:

recog_rule = MappingRule(

 name="recog",

 mapping={

“<text>”: Text("%(text)s/n"),

  },

 extras=[

  Dictation("text"),

  ],

 )

grammar.add_rule(recog_rule)

Here, <text> is any text read by the system, Text() function simply writes the spoken words to text, “extras” is used for any general rules, and Dictation() function makes “text” contain the spoken words. Such a rule is used to create a mapping from the sound to the corresponding text.

2.2 Conversion from Text to Shorthand Pattern

2.2.1 Input

The input source can be text-based inputs like a text file, or speech-based inputs like speech or recordings. The speech recognition methodology is presented in the previous section. The system reads the words from an input source. In this scenario, the input source for the conversion from text to shorthand pattern is a text file. The text file is formed from the speech-to-text conversion. There are two cases in the experiment of this system. In each of them, a text file forms the input to the conversion. In the first case, words are spoken into the microphone at 5-s intervals. In the second case, sentences are spoken into the microphone at 5-s intervals. At each of these intervals, the speech-to-text converter converts the previously spoken word into text and writes them into the text file. As soon as this text file is formed, the text-to-shorthand converter picks up the word in the text file and writes the pattern, the process of which is explained in Sections 2.2.3 and 2.2.4. This 5-s interval was found to be appropriate for the correct words to be written into the text file so that there are no errors found in both the speech-to-text and the text-to-pattern conversions.

2.2.2 Dictionary

The dictionary is the central repository for patterns. It is a mapping between a word and the keyboard pattern. The pattern itself is stored as a vector of key positions on a software keyboard. In the ATOMIK keyboard, each letter is assigned an absolute position. For example, in the above diagram of the ATOMIK layout, b is assigned 1; k is assigned 2; and so on (Figure 4). Punctuation and other action symbols can also be assigned such numbers. The complete set of these mappings is shown below.

b: 1, k: 2, d: 3, g: 4, c: 5, a: 6, n: 7, i: 8, m: 9, q: 10, f: 11, l: 12, e: 13, <space>: 14, s: 15, y: 16, x: 17, j: 18, h: 19, t: 20, o: 21, p: 22, v: 23, r: 24, u: 25, w: 26, z: 27.

These numbers are mapped to pixel positions in the software keyboard, which are drawn as vertices in the pattern path. Furthermore, the dictionary is optimized for faster reading and accessing. When the system reads the first word from the text file, it maps each letter in the word to a number from the above mapping. For example, if the first word read is the word HARD, the following is stored as a (key: value) mapping:

HARD: (", [19, 6, 24, 3]).

Here, the value part is a two-tuple consisting of a pointer as the first part and the actual mapping as the second. The significance of this pointer is explained when the system encounters another word whose substring is already present in the dictionary. A substring forms the initial parts of an actual string (word). When any string is read, the dictionary is scanned to find the biggest substring. It then takes a pointer to this substring and stores it as the first tuple of the value of the current string. The second tuple is the numeric mapping for the remaining letters. For example, HARD is a substring of the word HARDLY. Hence, the mapping for HARDLY is

HARDLY: (*, [12, 16]).

Here, * is a pointer to the word HARD. By following such a scheme, space efficiency is achieved; instead of storing the pattern for the word HARD, a pointer to it is stored. When the pattern is actually required, the pointer is de-referenced to get the pattern. A recursive method is used for the extraction of the word’s full pattern. To write this pattern, there must exist a mapping from each letter to a position on the ATOMIK keyboard (Figure 4). Thus, each letter is mapped to an integer corresponding to its position. This dictionary forms an integral part in the understanding of the natural language input. However, this dictionary is blind to the meaning of the words; it does not connote any meaning to the words, and the words are not arranged according to the part of speech they belong to. As seen in the above mapping, although HARDLY contains a reference to the word HARD, the dictionary itself does not provide any indication that HARDLY is an adverb and HARD is an adjective. It simply contains mappings to pixel positions.

2.2.3 Control Logic and Method

When a word is read, if its pattern is present in the dictionary, it is obtained and drawn. If not, it is added to the dictionary. If it is found that a substring of the word is present in the dictionary, the substring is added as the value for this word’s key, and the remaining part is a new vector. If an extension of the current word is present, the new substring is added and the longer word is modified. For example, if HARDLY is present and the parser encounters HARD, HARD is added and HARDLY is modified.

Step (1) HARDLY: (", [19, 6, 24, 3, 12, 16]).

Step (2) HARD: (", [19, 6, 24, 3]).

Step (3) HARDLY: (*, [12, 16]), ‘*’ → HARD.

2.2.4 Output

The output of the pattern is accomplished using the Pygame module [9] of the Python programming language [13]. Pygame aids in drawing lines onto an output window given the coordinates, the type of line, start and end points, and so on. A keyboard layout is created on a window using this API, which is identical to Figure 4. Each hexagon reflects one letter on the keyboard and the center of the hexagon forms the point where a vertex in the pattern is drawn. These center points are assigned numbers starting from 1, as shown in Section 2.2.2. The drawing module is given a set of integers obtained from the dictionary (Section 2.2.3). It maps the first integer to the corresponding center point on the keyboard layout. It then gets its pixel position and starts drawing a line to the next point in the integer set, and so on. For each word, a pattern is drawn on the keyboard and erased. Then, it is permanently shown on a bigger output screen while the next word is drawn on the layout. The resulting pattern for a word can look like the one shown in Figure 5.

3 Results

An experiment was conducted using 50 English, medium-length phrases that were given as speech input to the Pattern-Based Shorthand system, and the corresponding shorthand patterns were obtained. The words forming the input were provided through speech using a microphone to the computer. The speech recognition system obtained the voice signals through the NaturallySpeaking engine and the Dragonfly API, and converted into text. This text is continuously read and patterns were written on a drawing area provided using Pygame.

The input was given in two broad categories. In one kind of input, random words were spoken into the microphone at 5-s intervals. The speech recognition engine converted each word into an entry into a text file that the pattern-generating system immediately converted to patterns and provided the output. Twenty such words are given below, and the patterns are shown for three of these words:

Food, water, drink, book, plate, speak, write, sound, shirt, matter, posit, paper, mouse, week, harry, modify, storage, arrange, than, painting.

The second category of input consisted of 20 medium-length sentences. A 5-s gap was given between the sentences. Eight such example sentences are given below, and the corresponding patterns are shown for four such sentences in Figures 5, 9, and 10.

  1. Please call your friend and talk to him.

  2. Speak your name clearly.

  3. Live long and prosper.

  4. The great whale jumps out of the water.

  5. Keep calm and carry on.

  6. The cat on a hot tin roof (Figure 5).

  7. To be or not to be (Figure 9).

  8. Lord of the Rings (Figure 10).

Figure 9 Pattern for “To Be Or Not To Be.”
Figure 9

Pattern for “To Be Or Not To Be.”

Figure 10 Pattern for “Lord Of The Rings” with a Different Line Thickness.
Figure 10

Pattern for “Lord Of The Rings” with a Different Line Thickness.

In Figure 9, the number showing the row from which the pattern begins is not shown for some patterns. This is because the patterns are unique and simple. For example, the pattern for the word “be” is as shown in Figure 9. This pattern is unique to the word “be”; no other word corresponds to this pattern. Furthermore, the letter B is in row 1 in the keyboard (Figure 4). If the same pattern is drawn starting from any other row, it does not form an English word. For example, this pattern can also give the words “ct,” “eu,” and “gy,” or in the reverse order “pn” or “vi,” which are not English words. All such words encountered thus far have been recorded in the system so that numbers can be avoided. Also, for longer words, the pattern drawing shall be stopped after a certain degree of accuracy is obtained. This is done because the recognition and uniqueness of the patterns is accurate and the conversion back to English by a human is simple.

4 Future Enhancements

The experiment provides a large database of mappings from English words to patterns. This dictionary greatly aids in enhancing the performance of the conversion. A system can be built for pattern-to-speech conversion, which would involve digital image processing, specialized input (such as a special form of paper for pattern writing), pattern recognition, and contextual learning.

Speech recognition can be enhanced by using classification techniques to provide an error-free input to the conversion system [11].

Other forms of input can be introduced, such as image- and video-based inputs.

5 Conclusion

Rapid handwriting provides a method for fast transcription of dictated words and phrases. Such a transcription is present in a unique form of handwriting called shorthand. As these patterns are simply drawn and not written like words, the increase in speed of writing is natural. Indeed, certain shorthand schemes offer a higher writing rate than actual writing of words. Shorthand also provides the advantage of smaller storage space for the patterns, and the same is the advantage of this computer-based rapid handwriting system, which can be directly used to record spoken speech such as courtroom conversations or classroom lectures.


Corresponding author: H. K. Anasuya Devi, Professor and Faculty, Centre for Continuing Education, Indian Institute of Science, Malleshwaram, Bangalore, India, 560012

Bibliography

[1] R. U. Ayres and K. Martinás, On the reappraisal of microeconomics: economic growth and change in a material world, Edward Elgar Publishing, Cheltenham, UK, p. 41, 2005.10.4337/9781845427948Search in Google Scholar

[2] C. T. Butcher, Dragonfly, version 0.6.5. http://code.google.com/p/dragonfly. Accessed May 13, 2013.Search in Google Scholar

[3] R. C. Cassingham, The Dvorak keyboard, Freelance Communications, Arcata, CA (ISBN: 0-935309-10-1), pp. 21–26, 41–43, 1986.Search in Google Scholar

[4] C. M. Karat, C. Halverson, D. Horn and J. Karat, Patterns of entry and correction in large vocabulary continuous speech recognition systems, in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’ 99), ACM, New York, NY, pp. 568–575, 1999.10.1145/302979.303160Search in Google Scholar

[5] P. O. Kristensson and S. Zhai, SHARK2: a large vocabulary shorthand writing system for pen-based computers, in: UIST’ 04 – Proceedings of the 17th Annual ACM Symposium on User Interface Software and Technology, Santa Fe, NM, pp. 43–52, 2004.10.1145/1029632.1029640Search in Google Scholar

[6] G. H. Kumar, M. Ravishankar, P. Nagabushan and B. S. Anami, Hidden Markov model-based approach for generation of Pitman shorthand language symbols for consonants and vowels from spoken English, Sadhana31 (2006), 277–290.10.1007/BF02703382Search in Google Scholar

[7] P. U.-J. Lee and S. Zhaib, Top-down learning strategies: can they facilitate stylus keyboard learning?, Int. J. Hum. Comput. Stud.60 (2004), 585–598.10.1016/j.ijhcs.2003.10.009Search in Google Scholar

[8] T. Mitchell, Machine learning, Tata McGraw Hill, Noida, India, pp. xv, 1997.Search in Google Scholar

[9] P. Shinners, Pygame, version 1.9.2pre. http://www.pygame.org. Accessed May 13, 2013.Search in Google Scholar

[10] Shorthand, Encyclopaedia Britannica, 15th ed., 2010.Search in Google Scholar

[11] M. R. Smith and T. Martinez, Improving classification accuracy by identifying and removing instances that should be misclassified, in: Proceedings of International Joint Conference on Neural Networks (IJCNN 2011), pp. 2690–2697, 2011.Search in Google Scholar

[12] The Microsoft Corporation, Windows Speech Recognition.Search in Google Scholar

[13] G. van Rossum, Jr. and F. L. Drake, An introduction to Python, Network Theory Ltd., Bristol, United Kingdom, p. 3, 2011.Search in Google Scholar

[14] S. Zhai, M. Hunter and B. A. Smith, Performance optimization of virtual keyboards, Hum. Comput. Interact.17 (2002), 110, 229–269.10.1207/S15327051HCI172&3_4Search in Google Scholar

Received: 2013-5-15
Published Online: 2013-06-21
Published in Print: 2013-09-01

©2013 by Walter de Gruyter Berlin Boston

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded on 14.12.2024 from https://www.degruyter.com/document/doi/10.1515/jisys-2013-0039/html
Scroll to top button