Subproject 1: Multimodal Recognition and Modeling

Speech as input modality is crucial to the mobile dialogue scenario in SmartWeb. As a result of the high demands posed to speech recognition - with regard to both the aspired functional range and the required robustness against background noise and spontaneous language - the server-based speech recognition for application scenarios outside of a car will be of special importance.

Nowadays, speech recognition systems always work with a finite and, because of the technical frame conditions, also with a delimited vocabulary. Independent from the used amount of vocabulary, this means that in reality a certain percentage of spoken words lies outside of the recognition vocabulary (OOV = out of vocabulary); depending on the application this could mean up to 5 percent of all spoken words even if the maintenance of the vocabulary is done both extensively and effectively.

Within the SmartWeb scenario it is planned to employ a normal, application specific speech recognizer that is also able to meet the demands a mobile information retrieval with a potentially infinite vocabulary poses. In this case, it is not only possible but also very probable that especially the textually determining words for the information retrieval will not be covered by the system vocabulary.

Within the project, methods will be developed that allow the recognition and processing of unknown words (hybrid recognizer for words and word subunits) through an enlargement of the vocabulary with a suitable inventory of word subunits (e.g., sounds, syllables) as well as through an adequate enlargement of the grammatical speech model. Unknown words will be recognized and can be approximated by considering those word subunits that will make them accessible for further processing. A recognized sound or syllable sequence can be used for example for a "phonetic" search within the Internet by drawing from possible grapheme-representations from the word-subunit-representation, which can then in turn be used for further searching. As the pronunciation dictionary is dynamically structured and uses words from different languages, the speech recognizer has to be implemented as poly-lingual; this means, that a model inventory has to be available for several languages at the same time.

Multimodal mobile devices that offer visual information during a conversation lead to a user behavior that clearly differs from that when using conventional mobile phones. In a quiet surrounding, these devices are usually not held to the ear but rather held at a certain, individually varying distance from the face ("face-to-face"). This behavior and the challenging conditions of the application scenarios (for example the high level of noise in a football stadium) clearly influence the quality and the signal-to-noise ratio (SNR) of the recorded speech signal. Without a special adjustment of the speech recognition, a clear decline in the recognition performance is to be expected.

Access to SmartWeb is either gained on the field via a PDA/Smartphone (UMTS) with a server-based word-recognizer or in the mobile car-scenario via an built-in recognizer in the car. If supported by the PDA/Smartphone, the interface can be supplemented with multimodal input, e.g., with a pen. In addition to that, a camera will monitor the face and its different positions. In all these cases we are confronted with aggravated conditions that differ considerably from the laboratory recordings predominantly used up until now: from varying surrounding noises and driving sounds in the car scenario to heavily varying background noises and changing light conditions in the field, to extreme background noises like battle cries from fans in the stadium. This is why speech recognition and dialogue processing would show a worse performance without the support of multimodal recognition and processing. Especially in mobile situations and when dealing with multiple domains, a recognition of the adequate context and the consideration thereof are necessary to heighten both the intuitive usability of the system and the user satisfaction.

For the development of a mobile multimodal dialogue assistant in SmartWeb that is usable in open domains and thematically wide-ranging areas, spoken language is the central mode of communication for the human-computer interaction in the Semantic Web. In communications that are situation dependent and technically allow the use of the whole range of multimodal functionalities, the phonological output of the dialogue assistant using different output modalities - for example music as another form of acoustic output or visual (text, graphic, picture, video) and haptic representations - has to be spatially and timely synchronized. The synchronization is done by the multimedia-presentation component but the phonological characteristics have to be adapted to the multimodal interaction as well. During a spoken utterance this can for example be achieved through a deictic reference to objects that are synchronically presented in an additional modality or through an adequate prosody. In communicational situations where the multimodal functionality is limited, speech output often plays the primary role.

The results of the latest research (e.g., in SmartKom but also in independent studies) underline the special importance of a highly natural quality of the speech output. Dialogue system users accept artificial sounding voices only very unwillingly; they expect a speech output that has the voice quality of a natural voice and that comes close to the melodic and rhythmical structuring of an utterance produced by a human being. A natural sounding speech output generally reduces the cognitive strain for the user of the dialogue system - important especially for the use in a car. Speech recognized as a synthetically created one activates in contrast to natural speech stimuli additional areas of the brain.

© Webmaster
Last modified: Thu Jan 27 15:16:02 CEST