With the upcoming arrival of Society 5.0, a vast quantity of data will be accumulated in cyberspace through ubiquitous IoT devices and high-speed networks. A smart space guide is the key to organizing and effectively communicating the results of this raw, messy data to humans. Such an AI must have collaborative and relationship-building strategies to properly pick up on a visitor’s interests, curiosities and objectives, and realize personalized guidance through conversational interactions.
The BLENDi (Blended Dialog) project is developing a smart space guide AI agent that can lead visitors through a ‘digital twin’ environment such as a science or art museum. The AI skillfully captures users’ implicit feedback and adaptively develops a unique story. This is achieved by organizing vast amounts of information both in advance and dynamically, in tune with user interests and curiosity.
We have defined the evaluation metrics through demonstrative experiments, studying criteria such as information transmission efficiency, the learning effects, and willingness to revisit a museum as the performance indicators for our AI guide. This service provides solutions to enhance museum functionality and optimize operating costs.
The original idea behind BLENDi goes back to a spoken dialogue system named HANASHI-JOZU – meaning good conversationalist – which was developed by the Perceptual Computing Laboratory at Waseda University. The requirements for a “well-spoken” system were organized as follows:
Information transmission systems can be broadly classified into two types: push-type and pull-type. Push-type systems, e.g. a radio, enable the passive consumption of information and have the advantage of easy information accessibility. The disadvantage of push-type systems lies in their inflexibility: they do not allow a user to skip through content and listen to a report non-chronologically. Push-type systems also require users to persist through content they may find uninteresting.
Conversely, we have pull-type systems, e.g. a question-and-answer dialogue system. These allow users to obtain the desired knowledge based on active information acquisition. Pull-type systems have the following problem: users must continue asking questions in order to obtain the desired information, and posing such inquiries becomes burdensome. It is also difficult for the user to obtain a large quantity of data from a pull-type system.
The Perceptual Computing Laboratory at Waseda University has addressed these problems by developing a spoken dialogue system that enables users to access information by frequent swapping between push and pull modes.
This system facilitates dialogue based on pre-planned scenarios – the “primary plan” and the “subplan”. The primary plan is generated by summarizing and colloquizing the target document, and its function is to explain the main points of specific source materials. On the other hand we have the sub-plan, which ‘boosts’ the primary plan with supplementary material corresponding to user questions. As long as the user listens passively, the system communicates information in accordance with the primary plan. However, the user is free to ask questions at any point in time, after which the system will revert to the subplan and provide information that addresses the request. By preparing such scenarios in advance, the system has achieved a highly responsive, smooth information transfer via spoken dialogue [1, 2].
Automatic primary plan generation: the requirements for primary plans – i.e. scenarios to explain key information points of a specific source – include meeting user interest, coherency and being non-redundant. In this study, we formulated the problem behind generating personalized primary plans. We do so by analyzing the descriptions of each museum exhibit as an integer linear programming problem, which maximizes the objective function defined by the balance between a high degree of interest expressed towards a specific sentence, and a low degree of similarity between sentences. These are subject to the constraints of the discourse structure of each document and the total speaking time [4, 5].
Evaluation index: We defined the efficiency of the information transmission (EoIT), which is obtained as the harmonic mean of the percentage of sentences that are of interest to the user (coverage rate) and the exclusion of sentences that are un-engaging (exclusion rate). Using the created dataset, we confirmed that the EoIT of interest-oriented summaries was higher than those based on general importance [3, 4]. We also confirmed that personalized summaries were significantly more stimulating to the users.
Swift personalization: This integer linear problem is NP-hard, requiring extended amounts of time to provide optimal results as the scale of the problem increases. We therefore developed a method to obtain semi-optimal solutions at high speed, which takes advantage of quantum computing technology. Specifically, we formulated the primary plan generation problem as a quadratic unconstrained binary optimization (QUBO) problem and confirmed that a Digital Annealer – a simulated annealing-based Ising machine – could solve the QUBO problem efficiently and violation-free .
We categorized user intentions that increase or decrease the amount of information conveyed by the system. We defined “Question”, “Request Supplement”, and “Repeat Request” as the user intentions that increase the quantity of information to transmit. “Disinterest” and “Already Known” were used to classify intentions that decrease the quantity of information to transmit. “Wait Request” was used to avoid speech overlap.
Short utterances – each of about 1.5 seconds or less – were annotated for the above user intentions, obtained from the recorded audio of the system’s interaction with the user. We then developed a model to identify user speech intention based on the prosodic and linguistic information contained within user utterance, as well as the contextual information of the immediately preceding system utterance. Experiments using the constructed dataset have confirmed the effectiveness of our proposed method .
The roles of utterances in the discourse structure were classified into “Nucleus”, “Preamble” and “Supplement”. By training the duration and acoustic model with these auxiliary features, we realized an effective speaking style which foregrounded the nucleus utterances. In a comprehension test based on a news article, the user group which received oral explanations by the proposed method, displayed a higher score than the group which received oral explanations by means of a conventional speech synthesizer . In addition, a method to identify sentiment labels through sentence sequence labeling was examined for the purpose of controlling the sentence-level emotion parameters of the speech synthesis system .