I came across a project proposal recently involving a Spoken Dialogue System* that allows you to do a certain task (say book a railway ticket) over the phone. You give the system necessary information (the source, destination stations, date, time etc.) and it gives you a list of available trains. When you have decided what train you want to travel by, your call is transferred to a call center where you speak to a human to carry out the credit card transaction. Basically, in this model, when there is a ‘critical’ operation to be performed, your call is transferred to a human.
Some questions I had about the project were:
1) If I am going to be redirected to a call center anyway, why would I want to use such an application?
Answer: It saves the call center executive’s time because you have already decided what train you want to book a ticket for. An assumption here is that the call center executive is relayed the information that you have given to the dialogue system in real time and has had a chance to take a look at it before or while answering your call. Issues that will arise here are what information should be passed to the call center executive, in what form etc.
Unless this information relaying is handled very well, the user may have to repeat a large part of his conversation with the dialogue system with the call center executive, which can only lead to more frustration and a very low probability that the user will want to use the system again.
2) What I call the ‘Information Overload’ problem. When I ask the dialogue system for a list of trains from one station to another on a particular day, do I get the entire list? Do I get the timings for every train as well? How is this information going to be presented to the user?
Some solutions may be to only list (speak) the top three trains, or to ask the user more questions to narrow down the search results (time of the day), or to only mention trains with tickets available, or to sort results based on the price of the ticket…
As a user, I probably wouldn’t be very happy with the system if I know that I am not getting complete information. On the other hand, listening to (and keeping track of) dozens of trains with timings and ticket availability is just not possible. I would rather just log on to the Internet and go to the buggy IRCTC website and book a ticket using it.
3) And now, my main question. Why speech?
Why would I use an application like this at all, when I can use a website to do the same thing, with less hassles and more confidence that I have been given correct information? With the state at which dialogue systems are at today (at least the ones that are being developed in India), it is extremely difficult to carry out a conversation where everything is recognized and understood correctly by the system (even if acoustic models have been trained using your voice). A huge chunk of the conversation can end up being confirmation dialogues or you trying to tell the system that it has understood something wrong, and starting the conversation again.
Sure, such an application may be useful for a visually impaired person, but that is not the market being targeted. Why would someone who has a working Internet connection at home want to use this system? Would someone want to book a ticket while s/he is traveling, and only has his or her mobile phone and no access to the Internet? Possible, but not very likely.
Where can speech be used, then? When I’m stuck in traffic and want to know when I can catch the next bus home, I would definitely use a system like this. In fact, a very successful SDS, CMU’s Let’s Go system, does exactly this. The kind of people who will stand at a bus stop and make a call to the Let’s Go system are probably the ones who will really benefit from using a speech interface – people who may not be comfortable with using computers, people who may not have access to the Internet.
We need to be able to decide when it makes sense to use speech, and when it doesn’t.
* A Spoken Dialogue System is a system that you can have a conversation with to get a particular task done. SDSes, obviously, need to be natural sounding and robust. The main components of an SDS are a speech recognition system, language understanding module, dialogue manager, the database and interface, a language generation module and a text-to-speech system.