“Volcano erupting with lava” on madebyai.ai
“Volcano erupting with lava” on madebyai.ai

Create a Visual Chatbot on AWS EC2 with LLaVA-1.5

Get started with multimodal conversational models using the open-source LLaVA-1.5 model, Hugging Face Transformers, and Runhouse.

Photo of Denis K
Denis K

Engineer @ Runhouse

Photo of Matt Kandler
Matt Kandler

Engineer @ 🏃‍♀️Runhouse🏠

November 28, 2023

The full Python code for this tutorial, including standing up the necessary infrastructure, is publicly available in this Github repo for you to try for yourself.

Introduction

Multimodal conversational models represent a leap forward from text-only AI by harnessing the strengths of Large Language Models and Reinforcement Learning from Human Feedback (RLHF) to address challenges that require a combination of language and additional modalities (e.g. image and text). Visual capabilities in GPT-4V(ision) have been a recent highlight, leading to the creation of a sophisticated model proficient in both language and image comprehension in the same context. Though GPT-4V is undeniably advanced, the proprietary nature of such closed-source models often restricts the scope for research and innovation. The medical industry, for example, often requires FDA approval for new technologies. Relying on a model that produces varying results from month to month presents obvious challenges with reproducibility during audits.

Fortunately, the landscape is evolving with the introduction of open-source alternatives, democratizing access to vision-language models. Deploying these models is not trivial, especially on self-hosted hardware.

Runhouse is an open-source platform that makes it easy to deploy and run your machine learning application on any cloud or self-hosted infrastructure. In this tutorial, we will guide you step-by-step on how to create your own vision chat assistant that leverages the innovative LLaVA-1.5 (Large Language and Vision Assistant) multimodal model, as described in the Visual Instruction Tuning paper. After a brief overview of the LLaVA-1.5 model, we'll delve into the implementation code to construct a vision chat assistant, utilizing resources from the official code repository. Runhouse will allow us to stand up the necessary infrastructure and deploy the visual chatbot application in just 4 lines of Python code (!!).

What is LLaVA-1.5?

The LLaVA model was introduced in the paper Visual Instruction Tuning, and then further improved in Improved Baselines with Visual Instruction Tuning (also referred to as LLaVA-1.5).

The core idea behind it is to extract visual embeddings from an image and treat them in the same way as embeddings coming from textual language tokens by feeding them to a Large Language Model (Vicuna). To choose the “right” embeddings, the model uses a pre-trained CLIP visual encoder to extract the visual embeddings and then projects them into the word embedding space of the language model. The projection operation is accomplished using a vision-language connector, which was originally chosen to be a simple linear layer in the first paper, and later replaced with a more expressive Multilayer Perceptron (MLP) in Improved Baselines with Visual Instruction. The architecture of the model is depicted below.

Improved Baselines with Visual Instruction chart
Source: Improved Baselines with Visual Instruction Tuning

One of the advantages of the method is that by using a pre-trained vision encoder and a pre-trained language model, only the lightweight vision-language connector must be learned from scratch.

One of resulting impressive model versions, LLaVA-1.5 13b, achieved SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizing all public data, completed training in ~1 day on a single 8-A100 node, and surpassed other methods that use billion-scale data (source).

Building upon the fact that the LLaVA model is so lightweight to train and fine-tune, novel domain specific-agents can be created in no time. One such example is LLaVA-Med. More on it in a subsequent post.

LLaVA-1.5 Visual Chatbot Implementation

The full Python code is available in this Github repo so that you can try it yourself.

Creating a multimodal chatbot using the code provided in the official repository is relatively straightforward. The repository provides standardized chat templates that can be used to parse the inputs in the right format. Following the right format used in training and fine-tuning is crucial for the quality of the answers generated by the model. The exact template depends on the specific variant of the language model used. The template for LLaVA-1.5 with a pre-trained Vicuna language model looks like this:

The first few lines are the general system prompt used by the model. The special tokens <im_start>, <image>, and <im_end> are used to indicate where embeddings representing the image will be placed. The chatbot can be defined in just one simple Python class. Let’s highlight the outline of the class:

Let’s walk through the methods of the class defined above.

  • load_models: loads the language models, the tokenizer, and the image processor with the specified parameters for quantization using the BitsAndBytes library. The code is built upon the from_pretrained method used by Hugging Face transformers models. BitsAndBytes allows quantizing to model to 8-bit or 4-bit for reduced GPU memory requirements. In our case, we are using the 8-bit quantization that fits into one NVIDIA A10G GPU.
  • load_image: loads the image from either a local path or a URL and converts it to a tensor using the image processor. Please note in case of different quantizations, it would have to be adjusted.
  • generate_answer: returns the model's generated answer, potentially continuing the current conversation about the provided image. Again the generate method of the LLaVA-1.5 model is similar to the generate method of Hugging Face transformers models. Please note that as of today, the work on integrating LLaVA-1.5 into the HF transformers main branch is still in progress (see this and this).
  • get_conv_text: returns the raw text of the conversation so far.
  • start_new_chat: one of the main two methods of the model. It is used to start a new chat with the model. It creates a new conversation thread given 1). the image and 2). the initial textual prompt. It takes care of setting up the conversation using the templates defined in the repository, following the format discussed in the previous section.
  • continue_chat: the second main method, it continues an existing conversation about an image.

Now let’s look at the Runhouse-specific code in detail.

The LlavaModel class inherits from the Runhouse Module class. Modules represent classes that can be sent to and used on remote clusters and environments. They support remote execution of methods as well as setting public and private remote attributes.

This command defines a new on-demand cluster named rh-a10x utilizing 1 NVIDIA A10G GPU. A prerequisite to this command is setting up at least one cloud provider using the following documentation. For the purpose of this tutorial, we’ve used AWS.

Now that we’ve created our multimodal chatbot and defined an on-demand cluster to deploy to, we’ll walk through running the program and what an example output might look like.

The get_or_to method is an alternative to the simpler to function that allows us to deploy a LlavaModel instance to the gpu cluster defined above. It provides a way to reuse an existing instance if one is found with the specified name.

Running our Visual Chatbot

Now that our model is deployed to an on-demand cluster, we’re able to run the conversational UI using the start_new_chat method. We’ll be asking questions about a matcha hot dog image, to test the model’s ability to understand an unnatural image.

Hot dog with matcha generated by Stable Diffusion
"A hot dog made of matcha powder." generated by Stable Diffusion

The image_path corresponds to a publicly available link to the hot dog and the prompt is an initial question to ask our model. In addition, we’ll run the continue_chat method to demonstrate that this is an actual chat interface by asking a follow up question about the image.

After running this the first time to set up the infrastructure, the output will look like this. (Notice how logs and stdout are streamed back to you as if the application was running locally.)

If you want to try it for yourself, this tutorial is hosted in this public Github repo.

Conclusion

Visual chat models are a major step forward from text-only AI that introduce vision capabilities to conversations. For certain applications, self-hosting is crucial for auditability, reproducibility, and controlling the accuracy and performance of the application. This can be a hard requirement for certain medical and financial use cases. In addition, deploying with Runhouse can help reduce training and inference costs by automatically selecting cloud providers based on price and availability.

In a subsequent post we’ll explore use cases leveraging LLaVA-Med and other potential medical field machine learning applications.

Stay up to speed 🏃‍♀️📩

Subscribe to our newsletter to receive updates about upcoming Runhouse features and announcements.