Microsoft expands AI beyond text with new image, voice and transcription models
Microsoft has launched new AI models for images, audio, and speech-to-text. Here’s how they work and where they will be used.
Published By: Shubham Arora | Published: Apr 04, 2026, 07:00 AM (IST) | Edited: Apr 04, 2026, 07:00 AM (IST)
Microsoft has introduced a new set of AI models that aren't just focused on text anymore. The company has announced three models focused on image generation, voice output, and speech-to-text transcription. These are being positioned as part of Microsoft's broader push to expand its AI ecosystem, especially for developers and enterprise users.
The new models are available through Microsoft's Foundry platform and are also expected to show up across products like Copilot, Bing, and PowerPoint over time.
What the new AI models include
Microsoft has introduced three models -- MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. Each of these models is meant for a different kind of use.
The transcription model is built to turn speech into text. It supports multiple languages and can be used for things like meeting notes, captions, or voice-based tools. Microsoft says it works well across commonly used languages, based on its own testing.
Then there's the voice model, which can generate audio. It is designed to sound natural and can keep the tone consistent even in longer clips. It can generate audio quite quickly too, with longer clips ready in just a few seconds.
The image model is the second generation of Microsoft's in-house tool. It focuses on improving output quality, including better lighting, textures, and clearer text within images.
Where these models will be used
These models are not just limited to developers. Microsoft is also planning to bring these models into its existing ecosystem.
Some of these features will likely show up inside Copilot, mainly for audio and content tasks. The image model is also coming to Bing and PowerPoint, which people already use for making visuals.
It looks like Microsoft is trying to bring these tools into the apps people already use, instead of keeping them as separate features.
Microsoft expands AI beyond text
So far, most AI tools have focused heavily on text-based interactions. The company has already been pushing Copilot as a productivity tool across Office and its cloud services.
With these new additions, the focus is clearly moving towards using AI across different formats -- text, audio, and images. All of this just makes things easier to use. Whether it's creating audio, generating images, or turning speech into text, everything is now part of the same setup.
Get latest Tech and Auto news from Techlusive on our WhatsApp Channel, Facebook, X (Twitter), Instagram and YouTube.