ISCA Archive

The ISCA archive is a collection of more than 17000 papers from ISCA workshops and conferences, from the late 80s to the present. Abstracts of these papers are viewable by everyone, in most cases, full paper access is restricted to ISCA members.

The online version of the archive is available here. I’m extremely lucky to have it all on DVD – a gift given by ISCA this year to all the ISCA-SAC board members :)

What fun it would be to skim through all 17000 of those papers! Would take a few years, maybe, even if one reads 10 papers a day :)

Comments (1)

IEEE SLTC newsletter – Oct 09 edition

Link to newsletter

This time’s IEEE Speech and Language Processing Technical Committee newsletter contains reports from Interspeech, an explanation of the ICASSP paper review process and articles about satellite events around Interspeech like the YRRDS, the Blizzard Challenge and SIGDial. There’s a biography written by researchers at KTH of Gunnar Fant, one of the pioneers in speech research, who passed away recently. There’s also a description of ‘Dialogs on dialogs’, a group based in CMU which acts as a platform for students and researchers interested in dialogue systems to interact weekly *. This group has been around (and quite successful!) for quite some time, with people from all over the world joining in for the meetings via Skype. Also, like we have had in the past few issues of the newsletter, there are short interviews of famous speech researchers conducted by the Saras institute, transcribed by the volunteers from ISCA-SAC (Thanks Tadesse and Olga!).

A very interesting read indeed.

* It would be wonderful to start something similar here in India (or even internationally) for speech synthesis (or even speech and NLP in general). If anyone is interested in starting a group/joining, please get in touch! I’m basically looking for people I can read and discuss papers with.

Comments (1)

Blizzard Challenge 2009 papers

The Blizzard Challenge 2009 papers have been released on the Festvox website.

http://festvox.org/blizzard/blizzard2009.html

I’m going to spend the next few days reading them and will post about what I found interesting.

Leave a Comment

The Spoken Dialog Challenge

http://accent.speech.cs.cmu.edu/sdc

Prof Alan Black and Prof Maxine Eskenazi wrote in the July 2009 IEEE SLTC newsletter about the Spoken Dialog Challenge, which was followed by a paper at SIGDial (PDF) and a talk on frameworks and challenges for dialogue system evaluation at YRRDS (full proceedings, PDF).

Conducting such challenges helps compare systems and have detailed evaluations on the same standardized data, not to mention that they are great sources of motivation! I participated in a Blizzard Challenge user study for the first time this year and it’s really great to see a similar challenge for dialogue systems.

The domain for the first challenge (in 2010) is a Bus Information domain, presumably due to the success of Let’s Go!, the system that is deployed in Pittsburgh (I tried it out when I was there, and it seemed to work pretty well). Since this system is actually used by people, over 75,000 dialogues have been collected, which provide plenty of precious training material.

The actual evaluation of a dialogue system, like TTS systems, is still a research problem, so objective measures like task completion and turns taken and subjective measures like user satisfaction via questionnaires will be used. Different levels of users (researchers, undergraduates, volunteers and finally the people of Pittsburgh) will be used to test the system.

The details of the challenge are likely to be announced in mid-October, which will give participants around 6 months to develop their systems.

Thanks to Sylvie for the heads up via the ISCA-SAC Facebook group!

PS: If you are interested in speech processing, maybe you should check out the ISCA-SAC website or join our Facebook group. ISCA SAC is also looking for new volunteers, so please get in touch if you are interested in helping out!

Leave a Comment

Paolo Baggia Google Tech Talk

Paolo Baggia, Director of Standardization, Loquendo (the European leader in voice technologies) gave a very interesting talk on ‘Speech Technologies and Platforms – Present and Future Evolutions’ at Google in 2008, and I came across it recently.

A nice point he makes is (from Roberto Pieraccini’s blog), (slightly paraphrased) about how we like using ATMs because they are simple to use, fast, they never make mistakes and are available everywhere. ATMs are tools, not duplicates of humans. We know how to use them and what to expect from them. This really resonates with the way I have been thinking about speech and AI in general over the last few months.

An interesting idea he talks about is Loquendo’s use of a garbage node for grammar-based speech recognition, which simply discards everything spoken apart from the words or phrases you are looking for. This can help in modeling only the relevant part of the sentence. I haven’t looked at any grammar based speech recognition systems yet, but I’m sure this will be very useful in Spoken Dialogue Systems.

He also talks about Loquendo’s slightly unconventional HMM-NN hybrid ASR system and gives a brief overview of Speaker Identification. Slides accompanying the talk are available here.

Leave a Comment

Crippled without math

I’ve recently realized that if I want to really understand the algorithms involved in Speech, I need to un-learn and re-learn math. For the last couple of years, I’ve been trying to improve my knowledge of linguistics, and while that has certainly helped, I am now at a stage where I cannot proceed unless my math fundamentals are very very clear. I’m becoming more and more interested in signal processing, for which I definitely need a better understanding of algebra and calculus.

I used to love math in school, it was my favorite subject after computers. Of course, doing 3 horrible math courses in college killed that to a certain extent, the only one I enjoyed being Discrete Math.

So, before I (hopefully) start my PhD next year, I’m going to try and tackle some basics of Linear Algebra and Calculus at least.

Leave a Comment

Knowing when it makes sense

I came across a project proposal recently involving a Spoken Dialogue System* that allows you to do a certain task (say book a railway ticket) over the phone. You give the system necessary information (the source, destination stations, date, time etc.) and it gives you a list of available trains. When you have decided what train you want to travel by, your call is transferred to a call center where you speak to a human to carry out the credit card transaction. Basically, in this model, when there is a ‘critical’ operation to be performed, your call is transferred to a human.

Some questions I had about the project were:

1) If I am going to be redirected to a call center anyway, why would I want to use such an application?

Answer: It saves the call center executive’s time because you have already decided what train you want to book a ticket for. An assumption here is that the call center executive is relayed the information that you have given to the dialogue system in real time and has had a chance to take a look at it before or while answering your call. Issues that will arise here are what information should be passed to the call center executive, in what form etc.

Unless this information relaying is handled very well, the user may have to repeat a large part of his conversation with the dialogue system with the call center executive, which can only lead to more frustration and a very low probability that the user will want to use the system again.

2) What I call the ‘Information Overload’ problem. When I ask the dialogue system for a list of trains from one station to another on a particular day, do I get the entire list? Do I get the timings for every train as well? How is this information going to be presented to the user?

Some solutions may be to only list (speak) the top three trains, or to ask the user more questions to narrow down the search results (time of the day), or to only mention trains with tickets available, or to sort results based on the price of the ticket…

As a user, I probably wouldn’t be very happy with the system if I know that I am not getting complete information. On the other hand, listening to (and keeping track of) dozens of trains with timings and ticket availability is just not possible. I would rather just log on to the Internet and go to the buggy IRCTC website and book a ticket using it.

3) And now, my main question. Why speech?

Why would I use an application like this at all, when I can use a website to do the same thing, with less hassles and more confidence that I have been given correct information? With the state at which dialogue systems are at today (at least the ones that are being developed in India), it is extremely difficult to carry out a conversation where everything is recognized and understood correctly by the system (even if acoustic models have been trained using your voice). A huge chunk of the conversation can end up being confirmation dialogues or you trying to tell the system that it has understood something wrong, and starting the conversation again.

Sure, such an application may be useful for a visually impaired person, but that is not the market being targeted. Why would someone who has a working Internet connection at home want to use this system? Would someone want to book a ticket while s/he is traveling, and only has his or her mobile phone and no access to the Internet? Possible, but not very likely.

Where can speech be used, then? When I’m stuck in traffic and want to know when I can catch the next bus home, I would definitely use a system like this. In fact, a very successful SDS, CMU’s Let’s Go system, does exactly this. The kind of people who will stand at a bus stop and make a call to the Let’s Go system are probably the ones who will really benefit from using a speech interface – people who may not be comfortable with using computers, people who may not have access to the Internet.

We need to be able to decide when it makes sense to use speech, and when it doesn’t.

* A Spoken Dialogue System is a system that you can have a conversation with to get a particular task done. SDSes, obviously, need to be natural sounding and robust. The main components of an SDS are a speech recognition system, language understanding module, dialogue manager, the database and interface, a language generation module and a text-to-speech system.

Comments (2)

New Blog

Decided to start blogging again, this one is going to be about technical matters that interest me – Speech Processing, Natural Language Processing, linguistics, programming languages, math, User Interfaces, Assistive Technologies and the Internet, among other things.

I also blog here, but it is very random/rarely updated.

Comments (1)