Soft indexing of speech content for search in spoken documents

Chelba, Ciprian; Silva, Jorge; Acero, Alex

Artículo

Open/Download

file_5674.txt (0bytes)

Publication date

2007

Metadata

Show full item record

Cómo citar

Soft indexing of speech content for search in spoken documentsFormato de cita

Copiar

Cerrar

Author

Abstract

The paper presents the Position Specific Posterior Lattice (PSPL), a novel lossy representation of automatic speech recognition lattices that naturally tends itself to efficient indexing and subsequent relevance ranking of spoken documents. This technique explicitly takes into consideration the content uncertainty by means of using soft-hits. Indexing position information allows one to approximate N-gram expected counts and at the same time use more general proximity features in the relevance score calculation. In fact, one can easily port any state-of-the-art text-retrieval algorithm to the scenario of indexing ASR lattices for spoken documents, rather than using the 1-best recognition result. Experiments performed on a collection of lecture recordings-MIT iCampus database-show that the spoken document ranking performance was improved by 17-26% relative over the commonly used baseline of indexing the 1-best output from an automatic speech recognizer (ASR). The paper also addresses the problem of integrating speech and text content sources for the document search problem, as well as its usefulness from an ad hoc retrieval-keyword search-point of view. In this context, the PSPL formulation is naturally extended to deal with both speech and text content for a given document, where a new relevance ranking framework is proposed for integrating the different sources of information available. Experimental results on the MIT iCampus corpus show a relative improvement of 302% in Mean Average Precision (MAP) when using speech content and text-only metadata as opposed to just text-only metadata (which constitutes about 1% of the amount of data in the transcription of the speech content, measured in number of words). Further experiments show that even in scenarios for which the metadata size is artificially augmented such that it contains more than 10% of the spoken document transcription, the speech content still provides significant performance gains in MAP with respect to only using the text-metadata for relevance ranking. (c) 2006 Elsevier Ltd. All rights reserved.

General note

Publicación ISI

Identifier

URI: https://repositorio.uchile.cl/handle/2250/124726

Quote Item

COMPUTER SPEECH AND LANGUAGE Vol. 21 JUL 2007 3 458-478

Collections

Artículos de revistas