comscore

Google’s Gemini Embedding 2 lets AI understand text, images and video together

Google has launched Gemini Embedding 2, a multimodal AI model that can process text, images, audio and video together. Here’s what it does and how developers can use it.

Published By: Shubham Arora | Published: Mar 11, 2026, 12:57 PM (IST)

  • whatsapp
  • twitter
  • facebook
  • whatsapp
  • twitter
  • facebook

Google has launched a new AI model called Gemini Embedding 2. The model is designed to help AI systems handle different types of information such as text, images, audio and video within the same system. According to details shared by Google, the model is currently available in public preview through the Gemini API and Vertex AI. The company shared details about the model in a blog post. news Also Read: Google expands Gemini in Chrome to India with eight Indic languages

What Gemini Embedding 2 does

Gemini Embedding 2 is Google’s first embedding model that works across multiple media formats at once. Instead of processing text, images or videos separately, the model maps all of them into the same “embedding space”. In simple terms, it helps an AI system understand that a concept shown in a photo, mentioned in text, or spoken in audio can refer to the same thing. news Also Read: Waiting for ChatGPT adult mode? OpenAI says it’s not coming soon

Earlier embedding models were usually designed for text only. With Gemini Embedding 2, the system can look at different types of content together. For example, it can process a document that contains text and images at the same time. news Also Read: Google Pixel 11 Pro Fold leak suggests thinner body and camera redesign

Google says this approach simplifies how AI systems search, organise and analyse data. It can help with tasks such as semantic search, Retrieval-Augmented Generation (RAG), sentiment analysis and clustering large datasets.

Supported inputs and capabilities

The model supports multiple formats of input. For text, Gemini Embedding 2 can process up to 8,192 tokens in a single request. It can also analyse images, videos and audio directly.

According to Google’s documentation, the model can handle up to six images at a time in PNG or JPEG formats. For video, it supports clips up to 120 seconds in MP4 or MOV format. The model can also process audio directly, without needing to convert it into text first.

In addition, the model can embed documents such as PDFs up to six pages long. Developers can also send mixed inputs in a single request. For example, a request can include both text and images together.

How developers can use it

Gemini Embedding 2 is currently available in public preview through Google’s Gemini API and Vertex AI platform. Developers can use the model in applications that require such tasks as search, classification or recommendation systems.

Gemini Embedding 2 allows developers to change the size of the embeddings depending on their needs. This eventually helps manages storage and performance when working with large datasets.

Add Techlusive as a Preferred SourceAddTechlusiveasaPreferredSource