Introduction to VoiceXML and Voice Services
VoiceXML is an XML document type for authoring voicedriven audio applications. VoiceXML is being used in applications such as voice portals, where automated voice services can be accessed over the phone, and non-phone based voice applications, emerging in embedded home appliances and automobiles.
One of the most common examples of voice applications, that is familiar to many of us, is the Interactive Voice Response (IVR) application: an automated service where you can call in and check your bank accounts, the status of a package, or the estimated arrival time of a flight all without speaking to a human. VoiceXML is changing the way we interact with these and other voice applications.
The remainder of this chapter will gently introduce you to the world of computerized telephone applications. It will show the motivation behind VoiceXML as a clean, enterprise-computing-friendly, open standards technology that unifies the many aspects of the voice application world.
1.1 Voice services and applications
Before diving into VoiceXML, we first need to understand the current "state of the art" in voice application technology the "pre-VoiceXML" voice application landscape.
1.1.1 Traditional applications
Voice applications have traditionally been hosted on a machine called an IVR. This machine is typically a special purpose computer outfitted with telephony hardware, possibly speech processing hardware and software, and some sort of a dialog engine that executes the call-flow logic.
The pre-VoiceXML dialog engines may or may not be particularly programmable. For example, a voice-mail system is a simple IVR application but its call flow is more or less "hard-wired" only permitting callers to leave voice messages and users to check voice messages. You do not need to program your voice-mail system to do anything more than this.
As businesses started to realize the potential of providing phone access to customer data the programmable IVR system was born. This is an IVR system where the call flow is completely programmable. There would typically be some high-level scripting language for defining a call flow and often a low-level API so application programmers could write more complex applications, for example applications that perform database lookups.
Programmable IVRs had their problems, however. For one, they were typically very difficult to program. Since the call flow scripting languages were typically vendor-specific, each vendor had to reinvent the wheel in designing the language and the tools to help voice application developers write applications in this language. Developing software to interact with these programmable IVRs was even more difficult as the APIs were often very low-level or arcane.
Another serious problem was the proprietary nature of these systems. This made it nearly impossible to move an application from one IVR vendor platform to another. This culture of "platform lock-in" tended to put the equipment, application development, and deployment costs into a range where only the largest call centers could afford to consider sophisticated IVR services.
1.1.2 Emerging voice services
As the IVR platforms evolved, businesses increasingly incorporated this sort of technology with their customer relationship management (CRM) efforts. This trend can be understood if you consider:
Since the 1980s almost every business has implemented its information system using computers.
While the popularity of Internet as a way for customers to interact with businesses exploded in the 1990s, call center services remained of paramount importance. Consider that there are at least three times as many phone users worldwide as PC users.
If services that are appearing on the Web, such as information retrieval, stock quotes, product purchases, transactions, booking, bidding, and brokering, can be performed over the phone with the same ease and reliability, then interacting with the phone through voice is clearly the simplest interface it requires no software downloads to the handset!
Also, as devices become smaller with increasing chip-technology densities, display and keypad real estate is being constrained in size. Voice-based interfaces offer a solution.
Unfortunately, not all applications are well suited for voice interfaces. Large documents and multiple views are some of the features that are difficult to achieve with voice. On the other hand, there are certain tasks that voice is better suited for, such as, saying the name of the restaurant you are looking for rather than typing it in. The challenge is to make new applications that utilize voice dialogs.
As of the writing of this book we see the rapid growth of the voice-service industry. Most airlines, banks, shipping companies, and other time-sensitive businesses provide some sort of automated telephone support or information service. In addition, new breeds of services are emerging, such as "voice portals" which provide a service analogous to that of a Web portal. Also, natural dialog systems are starting to make inroads into the area of customer support. The fact that these services are accessible from wired phones, mobile phones, and other emerging wireless devices makes them ubiquitous.
1.1.3 Enabling technologies
While IVR technology has evolved slowly over the past couple of decades, the voice-application world has seen accelerated growth, mostly due to the coming of age of some core enabling technologies. These include:
Automatic Speech Recognition (ASR)
In the past couple of years improvements in speech-recognition algorithms combined with the explosion in available computing power have made speech recognition a realistic deployable technology. Early problems with speech recognition including speaker dependence and small vocabularies have largely been solved.
Just as the technology for recognizing human speech has taken a quantum leap, so has the technology for synthesizing human speech. TTS technologies now sound much more life-like increasing their understandability and their acceptance with a mass-consumer market. The emergence of TTS technologies has enabled the development of much more dynamic voice applications because, unlike traditional IVR applications where every audio response must be pre-recorded, a TTS system can generate speech responses on the fly.
Enterprise Software Integration Technologies
The birth of the Web ushered in a whole new breed of flexible scalable enterprise application servers. This has allowed business logic to be more easily accessible over network connections facilitating the integration of voice systems with back-end data system.