Microsoft expands AI beyond text with new image, voice and transcription models

Microsoft has launched new AI models for images, audio, and speech-to-text. Here’s how they work and where they will be used.

Published By: Shubham Arora | Published: Apr 04, 2026, 07:00 AM (IST)

Microsoft introduces new AI models focused on images, audio and speech.

techlusive.in Written By article news — Written By Shubham Arora

Microsoft has introduced a new set of AI models that aren’t just focused on text anymore. The company has announced three models focused on image generation, voice output, and speech-to-text transcription. These are being positioned as part of Microsoft’s broader push to expand its AI ecosystem, especially for developers and enterprise users.Also Read: What is Google Gemma 4? Features, models and use cases explained

The new models are available through Microsoft’s Foundry platform and are also expected to show up across products like Copilot, Bing, and PowerPoint over time.Also Read: Wikipedia bans AI-written content, bot hits back with angry blog posts

What the new AI models include

Microsoft has introduced three models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. Each of these models is meant for a different kind of use.Also Read: Xbox Games Showcase 2026 gets a date: When and where to watch, what to expect

The transcription model is built to turn speech into text. It supports multiple languages and can be used for things like meeting notes, captions, or voice-based tools. Microsoft says it works well across commonly used languages, based on its own testing.

Then there’s the voice model, which can generate audio. It is designed to sound natural and can keep the tone consistent even in longer clips. It can generate audio quite quickly too, with longer clips ready in just a few seconds.

The image model is the second generation of Microsoft’s in-house tool. It focuses on improving output quality, including better lighting, textures, and clearer text within images.

Where these models will be used

These models are not just limited to developers. Microsoft is also planning to bring these models into its existing ecosystem.

Some of these features will likely show up inside Copilot, mainly for audio and content tasks. The image model is also coming to Bing and PowerPoint, which people already use for making visuals.

It looks like Microsoft is trying to bring these tools into the apps people already use, instead of keeping them as separate features.

Microsoft expands AI beyond text

So far, most AI tools have focused heavily on text-based interactions. The company has already been pushing Copilot as a productivity tool across Office and its cloud services.

Add Techlusive as a Preferred Source

With these new additions, the focus is clearly moving towards using AI across different formats — text, audio, and images. All of this just makes things easier to use. Whether it’s creating audio, generating images, or turning speech into text, everything is now part of the same setup.