Microsoft expands AI beyond text with new image, voice and transcription models

Microsoft has launched new AI models for images, audio, and speech-to-text. Here’s how they work and where they will be used.

Published By: Shubham Arora | Published: Apr 04, 2026, 07:00 AM (IST) | Edited: Apr 04, 2026, 07:00 AM (IST)

Microsoft has introduced a new set of AI models that aren't just focused on text anymore. The company has announced three models focused on image generation, voice output, and speech-to-text transcription. These are being positioned as part of Microsoft's broader push to expand its AI ecosystem, especially for developers and enterprise users. Also Read: iPhone users in China to finally get Apple Intelligence; Here's why it was delayed

The new models are available through Microsoft's Foundry platform and are also expected to show up across products like Copilot, Bing, and PowerPoint over time. Also Read: Anthropic’s new AI ad is being called ‘unsettling’ online, even OpenAI CEO Sam Altman reacts

What the new AI models include

Microsoft has introduced three models -- MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. Each of these models is meant for a different kind of use. Also Read: OpenAI’s first AI device could be a screenless smart speaker that follows you around

The transcription model is built to turn speech into text. It supports multiple languages and can be used for things like meeting notes, captions, or voice-based tools. Microsoft says it works well across commonly used languages, based on its own testing.

Then there's the voice model, which can generate audio. It is designed to sound natural and can keep the tone consistent even in longer clips. It can generate audio quite quickly too, with longer clips ready in just a few seconds.

The image model is the second generation of Microsoft's in-house tool. It focuses on improving output quality, including better lighting, textures, and clearer text within images.

Where these models will be used

These models are not just limited to developers. Microsoft is also planning to bring these models into its existing ecosystem.

Some of these features will likely show up inside Copilot, mainly for audio and content tasks. The image model is also coming to Bing and PowerPoint, which people already use for making visuals.

It looks like Microsoft is trying to bring these tools into the apps people already use, instead of keeping them as separate features.