- Speaking and Hearing: Speech Synthesis and Speech Recognition
- Getting Up and Running with Java Speech Technology
- Politics of Speech-Enabled Software
I was paying for parking recently when I noticed that the ticket machine was speech-enabled. After I inserted my ticket, the machine told me in a tinny voice the amount to pay and then said (a tad impolitely), "Get your ticket." They say that 50% of communication is nonverbal, so the programmers of the parking machine might need to add some of this nonverbal content into the prompts. Still, it’s pretty impressive!
This article presents a very basic speech-enabled payment application. I discuss coding and design issues related to speech technology, and my examples employ speech synthesis. My focus is primarily on the practical elements (above and beyond "Hello World"), rather than theory. As you’ll see, all this technology has some interesting elements.
Speaking and Hearing: Speech Synthesis and Speech Recognition
Voice capabilities consist of two core speech technologies:
- Speech synthesis produces synthetic speech from text generated by an application, an applet, or a user. Speech synthesis is often referred to as text-to-speech technology.
- Speech recognition provides computers with the ability to listen to spoken language and to determine what has been said. In other words, recognition processes audio input containing speech by converting it to text.
Many organizations have limited voice recognition systems on their customer phone-support channels. This usage is a means of both reducing staffing levels and possibly making the host organization seem more technically advanced. Other services also exist in which text messages can be sent from mobile phones to landlines. The landline phone then uses a text-to-speech service to play the message to the user as a voicemail message. Some landline phones also allow for sending text messages—in a sense, using the text-to-speech service in reverse.
Just as podcasting is now a mainstream technology, we can expect to hear (pardon the pun!) a lot more about speech-enabled solutions. One area similar to podcasting is that of listening to audio versions of documents; for example, when traveling.
Speech recognition offers even more profound benefits to end users than does speech synthesis. For example, consider situations in which users are physically limited—such as doing tasks that require both hands (surgery, do-it-yourself projects, etc.) while trying to operate some kind of hardware.
Interestingly, the three speech recognition software packages I’ve tried were very complex to set up, or the results were useless. In either case, I didn’t have much success. This problem seems to indicate that speech recognition technology is not at the same level of market maturity as that of speech synthesis. You might have to spend a significant amount of money for a decent speech recognition solution.
There is a broad web context for speech-enabled software. Emerging standards, such as the Device Independent Authoring Language (DIAL) indicate that the audience for web content is growing fast. This growth is occurring in terms of the following:
- Device types (mobile phones, PDAs, laptops, and even children’s’ toys)
- Accessibility requirements
- Time (people want access to the same web pages at work and at home)
DIAL has some generic requirements that may affect the way in which speech technology is used. Let’s consider this issue briefly.
DIAL is a standard for how web pages should be designed and written to accommodate developments in web access, delivery networks, and device technology. It has as its major goal the production of web content that is available any time, any way, and anywhere. To make this pithy requirement set more concrete, let’s say that someone with a mobile phone is traveling home from work on a train and wants to see the value of his or her portfolio of shares. DIAL facilitates mechanisms to allow the web site to present the required data in a format that suits the needs of the user, the target device, and the delivery network. So, in this case, the content might be presented in an audio format or in a tightly summarized textual fashion because of the small screen.
DIAL provides for a sympathetic way of producing, conveying, and rendering web content. It’s entirely likely that DIAL will make special use of speech synthesis and recognition technologies (and other media, such as video).
Listing 1 shows an XHTML2 object definition:
Listing 1 An XHTML2 object.
<object src="http://www.example.com/stocks.mp3" srctype="audio/mpeg"> An audio file representing stocks. </object>
The object in Listing 1 is allowed by DIAL and could be downloaded to a device equipped with an audio/MPEG player. In turn, the player could incorporate a speech synthesizer. The important point here is that there’s an emerging nexus between web content, small devices, and speech synthesis technology. It’s only a matter of time before speech recognition is added to the mix to make the user experience even richer.
Writing Java-Based Voice Software
Overall, Java-based voice synthesis and recognition software isn’t particularly difficult to write. Free toolkits are available that provide pretty impressive results (for synthesis, at least) in a very short time.
The Java Speech API (JSAPI) is a definition of a standard, easy-to-use, cross-platform software interface to state-of-the-art speech technology, providing capabilities for both speech synthesis and speech recognition. The API is decoupled from implementations in order to provide the conditions for a vibrant market for speech technology. In this way, the industry can enjoy the use of a standard, well-researched specification and API, while still adding differentiating product features.
Without further ado, let’s get your system set up to run the examples.