Prompt Highlighter: Interactive Control for
Multi-Modal LLMs

Yuechen Zhang¹, Shengju Qian¹, Bohao Peng¹, Shu Liu², Jiaya Jia^1,2

¹The Chinese University of Hong Kong, ²SmartMore

Paper arXiv Code

Control text generation by highlighting your prompt! Prompt Highlighter is a training-free inference pipeline which facilitates token-level user interactions for a customized generation. Our method is compatible with both LLMs and VLMs.

This example is based on the original LLaVA-v1.5 13B, without any training.

Abstract

This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs, designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue, we introduce a novel inference method, Prompt Highlighter, which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights lead to more desired outputs. Our approach is compatible with current LLMs and VLMs, achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5, our method secured 69.5 in the MMBench test and 1552.5 in MME-perception.

An abstract pipeline of Prompt Highlighter. Users can control the focus of generation by marking out specific image regions or text spans. Then a token-level mask m is created to guide the language model's inference.

Application 1: Multi-Modal Partial Context Highlighter

Click image to get customized text generation results.

User: Describe this image.

User: Write a dialog based on this image.

User: Please give me a detailed plan to eat healthy and to lose weight.

Application 2: Control The Focus Degree.

Drag the slide bar to control the "compactness" of the generated text.

User: Write a summary of A Mid-Summer Nights' Dream, make it compact.

Compactness

Application 3: Faithful Descriptive Text Generation.

Highlighting all input contexts to generate faithful and better-aligned text. We further present text-to-image generation results by DALLE-3 to show the effectiveness of our method.

Observing highlighting image contexts can generate faithful text while eliminating model hallucination. Our training free inference pipeline also achieves a better result in common VL comprehensive benchmarks.

Method	MME-perception	MMBench-dev	MMBench-test
baseline (LLaVAv1.5-13B)	1531.3	67.7	67.0
Prompt Highlighter	1552.5	69.7	69.5

Application 4: Interactive Conversation

Here we present an example of multi-round interactive conversation powered by Prompt Highlighter. We illustrate the multi-round interactive conversation pipeline on the top. We provide a comparison in a multi-round conversation between the vanilla inference (left) and Prompt Highlighter (right). In this example, the user highlights different contexts in two rounds.

More Showcases

Interactive Text Generation with LLMs

Interactive Text Generation with VLMs

Faithful Text Generation, then T2I Generation Results

BibTeX

@article{zhang2023prompt,
      title={Prompt Highlighter: Interactive Control for Multi-Modal LLMs}, 
      author={Yuechen Zhang and Shengju Qian and Bohao Peng and Shu Liu and Jiaya Jia},
      year={2023},
      journal={arXiv preprint 2312.04302},
}