In the last few weeks, major players like OpenAI, Google, and Meta all introduced multimodal versions of their services. Multimodal AI refers to systems that can process and generate multiple types of data—such as text, images, and audio—simultaneously, mimicking human perception and enabling AI to understand and interact with the world in a more holistic manner. But what does that actually mean?
What is Multimodal AI?
It means that everything you know about AI is about to change.
For the last year or so, everyone has been obsessed with generative AI, which enables you to create new content such as text, visuals, or audio from a simple text description.
The idea behind multimodal AI systems is to take that a step further by not only handling different types of inputs and outputs, but analyze them together, and generate a relevant output, such as a caption for an image or a detailed analysis combining both data types.
But more than just generating images from text or even video from text, multimodal AI involves the ability to use these data types as input, such as enabling you to give the AI access to your phone camera and have it tell you things about what it sees. For example, Google's demo included someone walking around their office space with their phone camera, having the AI answer questions about what it was seeing, and then asking it where she'd left her glasses. The AI was then able to tell her exactly where to find them.
As you might guess, this capability significantly enhances the AI's ability to not only provide more contextually rich and accurate responses, but to act in a more "human-like" way, providing easier integration into human lives.
Why is Multimodal AI Important?
Again, the significance of multimodal AI lies in its potential to revolutionize various applications by providing more intuitive and more importantly, human-like interactions.
Multimodal AI can improve accessibility tools, enhance the user experience when it comes to virtual assistants, and enable more sophisticated content creation and analysis. In healthcare, it could analyze medical images and patient records simultaneously, leading to better diagnostics. In education, it could offer personalized learning experiences by understanding and responding to different types of input from students. A customer service AI can watch a video of a user's difficulties and tell them what's wrong and how to fix it.
OpenAI, Google, and Meta jump on the multimodal bandwagon
Like everything else generative AI, it's been a competition where one of the major players develops and announces new features (whether they're ready or not -- I'm looking at you, Sora), and the others bend over backwards to top it. (Which is great for consumers, because the companies are pushing each other to create new features quickly. Competition at its best.)
OpenAI
The first to make announcements was OpenAI -- after rescheduling their launch event to the day before Google's presentation. The major announcement was GPT-4o, with the "o" standing for "omni," as in "all things".
GPT-4o is OpenAI's latest flagship model, providing enhanced multimodal capabilities, supporting both text and image inputs. This update also focuses on improving natural language understanding and generation.
The key features of the announcement were basically seamless integration of text and images and enhanced natural language processing and generation capabilities, but GPT-4o is also being used to enable better coding assistance and code review as well as other functions.
Developers can also integrate GPT-4o into their applications (at half the cost per request of GPT-4) through the OpenAI API.
But of course OpenAI is about more than just the chatbot. Another major announcement was ChatGPT's ability to do Real-time interactive analysis of Excel.
In addition, GPT-4o is now available in 50 different languages, and supports real time conversations with you, where it can even understand and express emotion.
Not to be outdone (if at all possible) the Google I/O keynote consisted of almost 2 hours of new features for its own chatbot, Gemini, and there was a lot to unpack there. To start, they're positioning Gemini as a powerful multimodal AI assistant designed to integrate seamlessly with Google's suite of services, but there is so much more.
For example:
- Project Astra, Google's vision for the future of AI assistants.
- Audio Overviews for NotebookLM, which uses a collection of uploaded materials to create a verbal discussion personalized for the user.
- Grounding with Google Search, where the Gemini model can get up-to-date information from the internet, is now generally available in Vertex AI Studio.
- Imagen 3 which "understands natural language and intent behind your prompts and incorporates small details from longer prompts. This helps it generate an incredible level of detail, producing photorealistic, lifelike images with far fewer distracting visual artifacts than our prior models," Google's Molly McHugh-Johnson explained. Imagen 3 is also better at rendering text. You can sign up to join the waitlist, and Imagen 3 will be available in Vertex AI this summer.
- Veo creates high-quality 1080p resolution videos that can go beyond a minute, in a wide range of cinematic and visual styles. You can sign up for the waitlist, or wait until it's available with YouTube Shorts.
- Music AI Sandbox is a suite of music AI tools that let you create new music or transfer styles from other pieces.
- Or you can use VideoFX to turn an idea into a video clip, including a Storyboard mode.
- Gemini Advanced now has a 1 million token context window by taking advantage of Gemini 1.5 Pro, and can make sense of 1,500-page PDFs.
- Gemini Live (part of Gemini Advanced, Google's $20/month subscription) is a new, mobile-first conversational experience.
- Gemini Advanced subscribers will soon be able to create Gems, similar to OpenAI's GPTs, which are specialized versions of Gemini.
- AI Overviews in Search are rolling out to everyone in the U.S. beginning this week with more countries coming soon. This includes multi-step reasoning, such as “find the best yoga or pilates studios in Boston and show details on their intro offers and walking time from Beacon Hill.”
- And of course all of this will eventually be connected to your Google services such as Gmail and Google Docs.
- Thankfully, open source developers are not left out of the equation. Google also announced PaliGemma a vision-language open model optimized for visual Q&A and image captioning, as well as Gemma 2, with a 27B parameter instance that reportedly outperforms models twice its size and runs on a single TPU host.
- Gemini Nano will be built into Chrome 126.
- On the responsible AI front, Google is enhancing red teaming, where you try to break your own models. They're also expanding SynthID to text and video and will be open sourcing SynthID text watermarking through their updated Responsible Generative AI toolkit in the next few months.
- Then there's the Gemini 1.5 Flash update, which provides a more lightweight, faster model that's more than suitable for most applications.
- Illuminate can generate an audio conversation between two AI-generated voices, providing an overview of the key insights from research papers. You can sign up to try it today at labs.google.
- Finally, I want to give Google props for coming up with what may be the best developer prize ever. The prize for the Gemini API Developer Competition is an electrically retrofitted custom 1981 DeLorean. These people know their tribe.
Maybe it's just as well that OpenAI announced first, because there's an awful lot to digest here.
Meta
Of course, both GPT-4o and Gemini are proprietary models. But there's hope for us open source practitioners as well. Yesterday Meta introduced Chameleon, its own Multimodal Model. Researchers have trained both a 7-billion- and 34-billion-parameter version of Chameleon, and they claim that "On visual question answering (VQA) and image captioning benchmarks, Chameleon-34B achieves state-of-the-art performance, outperforming models like Flamingo, IDEFICS and Llava-1.5."
What's significant here is that if Meta follows the path it took with the Llama LLM, it will open source the model, providing developers and users with an alternative to OpenAI and Gemini.
Conclusion
The advancements in multimodal AI by OpenAI, Google, and Meta mark a significant step forward in making AI interactions more natural and human-like. These developments not only enhance the user experience across various applications but also open new possibilities for innovation in multiple fields. As these technologies become more accessible, we can expect to see a transformative impact on how we interact with AI on a daily basis. Whether it's through OpenAI's GPT-4o, Google's Gemini, or Meta's Chameleon, open source or commercial, the future of AI is undoubtedly multimodal, and that is going to change everything.