Recent advances in video conferencing have greatly improved remote video communication with features such as live captions and noise cancellation. However, there are various situations where dynamic visual augmentation would be useful to better convey complex and nuanced information. For example, when you’re debating what to order at a Japanese restaurant, your friends can share visuals to help you feel more confident about ordering Sukiyaki. Or when you talk about your last family trip to San Francisco, you can show a photo from your personal album.
Presented at ACM CHI 2023, “Visual Captions. augment verbal communication with live visuals,” we present a system that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a database we built for this purpose. We’ve open-sourced Visual Captions as part of the ARChat project, designed for rapid prototyping of augmented communication with real-time transcription.
![]() |
Visual captions facilitate verbal communication with real-time visuals. The system is even robust against typical errors that can often appear in real-time speech-to-text transcription. For example, the out-of-context transcription model mistook the word “pier” as “couple,” but Visual Captions still recommends images of the Santa Monica Pier. |
A design space to enhance verbal communication with dynamic visuals
We invited 10 internal participants, each with different technical and non-technical backgrounds, including software engineers, researchers, UX designers, visual artists, students, and more, to discuss their specific needs and desires for a real-time visual augmentation service. In two sessions, we presented low-fidelity prototypes of the envisioned system, followed by video demonstrations of existing text-to-image systems. These discussions informed an eight-dimensional design space for visual augmentation of real-time conversations, denoted below as D1 through D8.
Visual adjuncts can be synchronous or asynchronous with speech (D1: Temporal), can be used to express and understand speech content (D2: Topic), and can be used with different visual content, visual types, and a wide range of visual content. sources (D3: Visual). Such visual magnification may vary depending on the scale of the meetings (D4: Scale) and whether the meeting is in a collaborative or remote setting (D5: Space). These factors also influence whether visuals should be displayed privately, between participants, or publicly for all (D6: Privacy). Participants also identified different ways in which they would like to interact with the system when having conversations (D7: Initiative). For example, people suggested different levels of “proactiveness,” which indicates the degree to which users would like the model to be proactive. Finally, participants imagined different methods of interaction, such as using speech or gestures for input. (D8: Interaction).
![]() |
A design space to enhance verbal communication with dynamic visuals. |
Informed by this initial feedback, we designed Visual Captions to focus on creation synchronously semantically relevant visuals visual content, type:and: source. While participants in these initial exploratory sessions engaged in one-to-one remote conversations, visual captioning in the wild will often be one-to-many (eg, an individual introduces an audience) and many-to-many. – many scenarios (for example, a discussion between several people during a meeting).
Because the visual that best complements a conversation depends heavily on the context of the discussion, we needed a custom tutorial. Thus, we collected 1595 quadruplicate data sets language (1), visual content (2), type (3)and: source (4) in a variety of contexts, including everyday conversations, lectures and travel guides. For example, “I’d like to see it.” matches the visual content of “smiling face”, the visual type of “emoji”, and the visual source of “public search”. “Did he tell you about our trip to Mexico?” matches the visual content ‘photo from a trip to Mexico’, the visual type ‘photo’ and the visual source ‘personal album’. We publicly released this VC1.5K database to the research community.
A Visual Intention Prediction Model
To predict what visuals can complement a conversation, we built a visual intent prediction model based on a large language model using the VC1.5K database. For training, we parsed each visual intent into the “” format<Visual Type> of <Visual Content> from <Visual Source>
“.
{"prompt": "<Previous Two Sentences> →", "completion": "<Visual Type 1> of "<Visual Type 1> from "<Visual Source 1>; <Visual Type 2> of "<Visual Type 2> from "<Visual Source 2>; ... \𝑛"}
Using this format, this system can conduct open-vocabulary conversations and contextually predict visual content, visual source, and visual type. Anecdotally, we’ve found that it outperforms keyword-based approaches that can’t handle open vocabulary examples like “Your Aunt Amy is visiting this Saturday” and can’t suggest appropriate visual types or visual sources.
![]() |
Examples of visual intention predictions from our model. |
We used 1276 (80%) instances from the VC1.5K dataset to refine the large language model, and the remaining 319 (20%) instances as test data. We measured the performance of the adjusted model with a token accuracy metric, i.e., the percentage of batch tokens correctly predicted by the model. During training, our model achieved 97% training mark accuracy and 87% validation mark accuracy.
Execution
To evaluate the utility of the trained visual caption model, we invited 89 participants to perform 846 tasks. They were asked to provide feedback on six qualitative statements on a scale of 1—Strongly Disagree to 7—Strongly Agree. The majority of participants preferred visual chat (Q1, 83% ≥ 5–Somewhat agree). Moreover, they found the visuals shown to be useful and informative (Q2, 82% ≥ 5–Somewhat agree), of high quality (Q3, 82% ≥ 5–Somewhat agree), and relevant to the original speech (Q4, 84%). ≥ 5 – Somewhat agree). Participants also found that the predicted visual type (Q5, 87% ≥ 5–Somewhat agree) and visual source (Q6, 86% ≥ 5–Somewhat agree) were accurate given the context of the respective conversation.
![]() |
Results of the technical evaluation of the visual prediction model evaluated by the study participants. |
With this clear visual intent prediction model, we developed Visual Captions on the ARChat platform, which can add new interactive widgets directly to the camera streams of video conferencing platforms such as Google Meet. As shown in the system workflow below, Visual Captions automatically captures the user’s speech, retrieves the last sentences, feeds them to a visual intent prediction model every 100ms, retrieves relevant visuals, and then offers visuals in real-time.
![]() |
Video Captioning System Workflow. |
Visual captions provide three levels of activity when presenting visuals:
- Automatic display (high activity). The system autonomously searches for and displays visual images to all meeting participants. No user interaction required.
- Automatically suggest (average activity). Featured visuals are displayed in private scrolling. The user then clicks on the visual to display it publicly. In this mode, the system actively offers visuals, but the user decides when and what to show.
- According to demand-offer (low activity). The system will only display visuals if the user presses the space bar.
Quantitative and qualitative assessment. user studies
We evaluated visual captions in two controlled laboratory studies (n: = 26) and deployment studies in the wild (n: = 10). Participants found that real-time visuals facilitated live conversations, helping to explain unfamiliar concepts, resolve language ambiguities, and make conversations more engaging. Participants also reported that there were different preferences for interacting with the system insitu, and that different levels of activity were preferred in different social scenarios.
![]() |
Participants’ task burden index and Likert scale ratings (from 1 – Strongly Disagree to 7 – Strongly Agree) of four conversations without visual captioning (“No VC”) and three visual captioning modes: auto-display, auto-suggest, and turn-on. – request an offer. |
Conclusions and future directions
This work proposes a real-time visual enhancement system for spoken communication, called Visual Captions, trained using a database of 1595 visual intentions collected from 246 participants, covering 15 topic categories. We are publicly releasing the training dataset, VC1.5K, to the research community to support further research in this area. We’ve also built Visual Captions into ARChat, which makes video conferencing easier in Google Meet by transcribing meetings and adding camera video streams.
Visual captions are a significant step toward enhancing verbal communication with in-flight visuals. By understanding the importance of visual cues in everyday conversations, we can create more effective communication tools and improve the way people communicate.
Gratitude
This work is a collaboration between multiple teams at Google. Key contributors to the project include Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xuxiu Yuan, Peggy Chi, Alex Olval, and Ruofei Du.
We would like to extend our thanks to those on the ARCchat team who provided assistance, including Jason Mayes, Max Spear, Na Li, Jun Zhang, Jin Jin, Yuan Ren, Adarsh Caudle, Ping Yu, Darcy Philippon, and Ezgi Ozteljan. We would also like to thank the many people with whom we had insightful discussions and who provided feedback on the manuscript, including Eric Turner, Yinda Zhang, Feitong Tan, Danhan Tang, and Shahram Izadi. We would also like to thank our CHI reviewers for their insightful feedback.