close
Categories
News

Ideogram 4.0: 9.3B Open-Weight Image Model With 2K JSON Layout and Local Inference

Ideogram 4.0 is a 9.3B open-weight text-to-image model trained from scratch for design-grade output: native 2K resolution, structured JSON prompts with bounding boxes and colour palettes, and weights you can download, fine-tune, and run locally—while the hosted app, API, and MCP stay on the same model.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph LR
  JSON[JSON prompt validated] --> ENC[Qwen3-VL-8B text encoder frozen]
  ENC --> DIT[9.3B single-stream DiT 34 layers]
  NOISE[Flow-matching noise] --> DIT
  DIT --> SAM[Euler sampler asymmetric CFG]
  SAM --> VAE[KL VAE decode frozen]
  VAE --> IMG[Up to 2048 px RGB]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff

  class DIT,IMG agent
  class ENC,VAE,SAM hook
JSON prompt through frozen VLM encoder, trainable DiT, and VAE decode to 2K pixels

Only the 9.3B DiT is trained; encoder and VAE stay frozen at inference.

What shipped on 3 June 2026

  • Open weights on Hugging Face (ideogram-ai/ideogram-4-nf4, ideogram-4-fp8)—gated; accept the Ideogram 4 Non-Commercial Model Agreement to download
  • Inference code Apache 2.0 on github.com/ideogram-oss/ideogram4nf4 fits a single 24 GB GPU; fp8 for broader hardware
  • Every Ideogram plan plus the API and MCP for agent workflows—same visual model, different surfaces
  • Post-training editing stack in the product: prompt edit, native transparency, layerised text, extend, reframe, upscale, remix, magic fill

Launch reel: open weights, JSON prompting, and product surfaces.

Architecture (technical blog)

ComponentDetail
Trainable core34-layer single-stream DiT (~9.3B params); text + image latent tokens share attention with QK-RMSNorm and 3D MRoPE
Text encoderQwen3-VL-8B-Instruct (text-only); hidden states from 13 intermediate layers concatenated—not a single final layer
DecoderFrozen KL VAE, 8× spatial compression, 128 latent channels
SamplerEuler flow matching with asymmetric CFG (unconditional pass drops text tokens entirely)
Resolution256–2048 px per side, flexible aspect ratios; up to 2048 text tokens
PresetsV4_TURBO_12, V4_DEFAULT_20, V4_QUALITY_48 (quality tail lowers guidance near t=0)
Layout boxes, hex palettes, and typed text blocks for poster-grade generation

Training and inference share one schema—the pipeline rejects prompts that do not parse.

Structured JSON prompting

Training and inference both use the same JSON caption schema. The reference pipeline validates every input and rejects non-conforming JSON—plain strings are expanded via an optional magic prompt (hosted API with IDEOGRAM_API_KEY, or local LLM) into the structured format.

  • Bounding boxes[y_min, x_min, y_max, x_max] in 0–1000 normalised coords (origin top-left)
  • Colour palettes — up to 16 hex colours per image, 5 per element
  • Typed texttext elements carry the literal string plus a styling description for multi-font posters
  • Composable elementsobj and text entries under compositional_deconstruction
python run_inference.py \
  --prompt "campaign poster with clean type" \
  --output out.png \
  --quantization nf4 \
  --magic-prompt-key "$IDEOGRAM_API_KEY"

Benchmark claims (treat as directional)

AxisIdeogram 4.0 (reported)Notes from technical blog
Layout control0.69 mIoU7Bench bounding-box adherence
Text rendering0.97 OCR accuracyX-Omni English; leads open weights on param efficiency chart
Spatial reasoning0.76SpatialGenEval (spatial + basic splits)
Prompt alignment0.89Prism-bench alignment track
Designer ELO#2 overall, #1 open4,366 pairwise votes; GPT Image 2 ranked #1 closed

Independent smoke tests on single prompts can disagree with vendor charts—run your own evals for brand, type, and layout tasks that matter to you.

Three ways in (API, MCP, app)

SurfaceUse case
APIEmbed generation, editing, upscaling in products (ideogram-4.0 model id per developer docs)
MCPAgents create or revise visuals inside existing toolchains
AppHands-on iteration with editing controls
Open weightsLocal inference, research, fine-tuning experiments (non-commercial license unless separately licensed)

Partner integrations mentioned at launch include ComfyUI, fal, Replicate, Krea, Leonardo, Cloudflare, and others—check each host for rollout status.

License reality check

Code: Apache 2.0. Weights: Ideogram 4 Non-Commercial Model Agreement (gated on Hugging Face). You can inspect, run locally, and fine-tune for research and non-commercial work; commercial deployment needs a separate commercial path via Ideogram. That split is common in “open weight” image releases—open code ≠ unrestricted commercial weights.

Builder takeaway

QuestionAnswer
Why not only API?Weights + JSON schema let you build ComfyUI graphs, custom fine-tunes, and on-prem design pipelines
Minimum GPU?~24 GB with nf4 checkpoint
Fastest first image?ideogram.ai app or magic-prompt CLI above
Posters / type-heavy work?Lean into JSON text elements + bboxes; validate schema before generate
vs FLUX / Hunyuan?Smaller 9.3B DiT with VLM encoder stack; Ideogram claims text/layout lead among open weights at this size

Research supplement

Technical architectural details cited above were verified from the Ideogram 4 NF4 model card on Hugging Face, which provides primary documentation on the DiT layer count (34 layers), the Qwen3-VL-8B-Instruct text encoder, the 13-layer multi-scale feature extraction approach, and dual-branch classifier-free guidance. This model card is the authoritative primary source for architecture specifics not typically reproduced in secondary coverage.

---

References

Categories
News

Generative UI in 2026: Controlled, A2UI Declarative, and Open-Ended Patterns on AG-UI

Generative UI lets agents render real interface widgets—not only chat text—so a user who asks for a table gets a table, a budget gets cards, and tool output streams inline over a standard agent↔frontend wire.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  USER[User in React app] <-->|SSE AG-UI events| RUN[CopilotKit runtime proxy]
  RUN <-->|tool calls state deltas| AGENT[Agent backend ADK LangGraph etc]
  AGENT --> MCP[MCP tools and data]
  AGENT --> A2A[A2A other agents]
  AGENT --> A2UI[A2UI schema ops on AG-UI stream]
  A2UI --> RENDER[Catalog maps JSON to components]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  classDef decision fill:#444,color:#fff

  class USER,AGENT agent
  class RUN,A2UI hook
  class RENDER decision
Controlled pre-built components, declarative A2UI schema, and open-ended sandboxed HTML

Pick the pattern on purpose—most drift into Controlled because the framework default does.

The protocol stack (three jobs)

ProtocolRoleTypical transport
MCP (Model Context Protocol)Connects agents to tools and datastdio / HTTP per MCP server
A2A (Agent-to-Agent)Connects agents to other agentsGoogle-led agent coordination
AG-UI (Agent–User Interaction)Connects agents to user-facing appsSSE (also WebSockets in spec)

AG-UI is an open, event-based protocol (MIT, ag-ui-protocol) born from CopilotKit’s agent↔UI work with LangGraph and CrewAI. During a run the backend emits typed events—text chunks, TOOL_CALL_START/END, STATE_DELTA patches—over a single HTTP POST plus SSE stream. State can flow both ways on the same channel: user edits surface to the agent; agent mutations surface to the UI without a second model call when you wire shared state.

A2UI (Apache 2.0, a2ui.org) is Google’s declarative spec for agents emitting UI as JSON schema. It rides on AG-UI; CopilotKit ships production renderers. v0.9 moves to a prompt–generate–validate loop: catalog rules live in the system prompt, the model generates freely, validators catch errors, and the agent self-corrects before the client sees bad JSON.

Roughly 400 tokens per registered component versus flat cost with a declarative catalog

Past about 15 render tools, declarative A2UI usually pays for itself in context window alone.

Three patterns (not three frameworks)

Most teams confuse “Generative UI” with whichever pattern their framework defaults to. In practice there are three architectural choices on a control→flexibility spectrum:

PatternWho owns layoutAgent seesBest when
ControlledYour design system (pre-built React components)One tool per component (~400 tokens each)≤10 pixel-perfect flows
Declarative (A2UI)Catalog + schema; agent fills dataOne tool returning a2ui_operationsLong tail of cards, forms, dashboards
Open-endedModel (HTML or MCP App surface)Sandboxed iframe or MCP Apps middlewareOne-shot throwaway visuals

Pattern 1 — Controlled (frontend owns UI)

You register a React component against a tool name (e.g. CopilotKit’s frontend action hook). The runtime advertises the tool over AG-UI; when the agent calls it, args stream in as props and the component renders inline. No Python tool required for the happy path—design tokens stay yours.

Token tax: every registered component sits in context before the user speaks. ~400 tokens per tool description × 25 components ≈ 10,000 tokens per turn. Past ~15 tools, descriptions overlap (“pie chart” vs “donut chart”) and mispicks rise. Fix descriptions around user intent (“compare proportions of a whole”) not widget names.

Shared state exception: when the agent must pin a metric or append a table row and other panes must update without another LLM call, add an agent-side tool that writes session state; the UI subscribes via the shared-state hook while chat still uses the same frontend tool name.

Pattern 2 — Declarative (A2UI schema)

The agent returns an ordered list of operations—typically create_surface, update_components, update_data_model—with a catalogId your frontend registered. One function can power dozens of card types; token cost stays flat as the library grows.

def search_flights(flights: list[Flight]) -> dict:
    return {
        "a2ui_operations": [
            {"type": "create_surface", "surfaceId": SURFACE_ID, "catalogId": CATALOG_ID},
            {"type": "update_components", "surfaceId": SURFACE_ID, "components": FLIGHT_SCHEMA},
            {"type": "update_data_model", "surfaceId": SURFACE_ID, "data": {"flights": flights}},
        ]
    }

Fixed vs dynamic schema: in fixed mode you author flights.json and the agent only supplies data; in dynamic mode a secondary LLM drafts the component tree per turn but still emits the same a2ui_operations envelope. The catalog is the contract—Zod (or JSON Schema) for allowed components; renderers map types to React. A common production bug: CATALOG_ID on the agent and catalogId in createCatalog on the client differ by one character—UI silently falls back to the basic catalog with no console error.

Trade-off: the model owns layout within the catalog; runs vary. Not for legal copy or marketing surfaces that need pixel lock.

Pattern 3 — Open-ended (MCP Apps and sandboxed HTML)

  • MCP Apps — MCP servers expose UI surfaces (e.g. diagram canvases). CopilotKit’s MCP Apps middleware attaches servers without hand-rolling the client protocol.
  • Sandboxed HTML — the runtime injects an HTML render tool; the agent returns markup inside an iframe with sandbox allowing scripts + forms, never allow-same-origin.

Open-ended shines for disposable answers (“visualise this API response”) and fails as a primary product surface—brand and layout drift run to run even with style rules in the system prompt. Typical iframe failure: buttons dead because sandbox flags omit allow-forms.

How to choose (decision tree)

  • Designer shipped pixel-perfect mocks for this flow? → Controlled
  • Dozens of card types or widgets? → Declarative
  • One-shot chart the user will never see again? → Open-ended
  • Unsure? → default Declarative; promote top 3 flows to Controlled; never default Open-ended
  • Already shipping >15 render tools? → you are in Controlled territory; start A2UI this week

Open templates (awesome-llm-apps)

The generative_ui_agents folder ships runnable references across all three patterns:

FolderPattern emphasis
generative-ui-starter-projectControlled hooks + fixed/dynamic A2UI (flights catalog)
ai-financial-coach-agentControlled budget cards
ai-dashboard-canvas-agentControlled + shared state
ai-deep-research-agentStreaming research cards
mcp-apps-generative-ui-showcaseMCP Apps (travel booking UI in chat)
ai-mcp-app-builderAgent writes new MCP app in E2B sandbox
ai-shadcn-component-generatorComponent generation utilities

Builder takeaway

QuestionAnswer
Wire format?AG-UI over SSE; A2UI payloads as tool results / custom events
Prototype fast?Controlled frontend tools—watch token count
Scale UI variants?A2UI declarative + matched catalog IDs
Demo wow factor?Open-ended HTML or MCP Apps—keep off the main nav
Docs to read first?docs.ag-ui.com, A2UI v0.9 guide, CopilotKit generative UI docs

Research supplement

Web search and web fetch were not available during this analysis (permissions not granted). No externally-sourced supplementary research could be added. Claims above are derived from the reference links provided by the author, the article title, and training-data knowledge of AG-UI and A2UI as of mid-2026. Readers should verify AG-UI adoption metrics (GitHub stars, npm downloads), A2UI v0.9 schema specifics, and any performance benchmarks directly via the linked primary sources.

---

References

Categories
News

Gemma 4 12B: Encoder-Free Multimodal AI for Laptops (Apache 2.0, 256K Context)

Gemma 4 12B is Google DeepMind’s new encoder-free multimodal open model: text, image, video, and native audio flow through one decoder-only transformer under Apache 2.0, aimed at laptops with about 16 GB VRAM or unified memory and reasoning that approaches the larger 26B MoE sibling.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  IMG[48x48 image patches] --> LLM[Gemma 4 12B decoder]
  AUD[16 kHz audio frames] --> LLM
  TXT[Text tokens] --> LLM
  LLM --> OUT[Text plus tool calls]

  ENC[Separate vision and audio encoders] -.->|not used in 12B| LLM

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class LLM agent
  class ENC hook
Diagram of text, image patches, and audio frames feeding one decoder

Vision and audio enter the same transformer as text—no separate vision or audio encoder stacks.

What encoder-free changes

Earlier Gemma 4 sizes (E2B/E4B and 31B) attach dedicated vision transformers (~150M–550M) and audio conformers (~300M). Gemma 4 12B Unified removes those stacks:

  • Vision (~35M embedder) — 48×48 pixel patches projected with one matmul plus factorised X/Y positional lookups; the LLM backbone does the heavy visual reasoning.
  • Audio — no conformer encoder; 16 kHz audio is sliced into 40 ms frames (640 floats) and linearly projected into token space.
  • Fine-tuning — LoRA or full tuning updates vision, audio, and text in one pass (Hugging Face, Unsloth) instead of co-tuning frozen encoders.

Google positions this as lower multimodal latency, a smaller memory footprint than medium models with separate encoders, and the first mid-sized Gemma with onboard audio (audio was previously limited to small edge variants).

Laptop-class deployment with quantised weights and local inference

Google targets ~16 GB VRAM or unified memory with Q4 weights around 6.7 GB plus KV cache.

Model specs and memory

PropertyGemma 4 12B Unified
Total parameters~12B (11.95B listed on Hugging Face)
Layers48 (hybrid local + global attention; final layer global)
Context256K tokens
ModalitiesText, image, video (frames), audio (E2B/E4B/12B only)
Languages140+ pre-training; 35+ out of the box
Weight load (BF16, weights only)~26.7 GB per Google sizing table
Q4 quantised load~6.7 GB (weights only; KV cache extra)
Practical laptop bar16 GB VRAM/unified memory (Google launch guidance)

Benchmark snapshot (instruction-tuned)

Figures below come from the official Gemma 4 12B model card (June 2026), comparing instruction-tuned variants:

Benchmark12B Unified26B MoE31B DenseE4B
MMLU Pro77.2%82.6%85.2%69.4%
LiveCodeBench v672.0%77.1%80.0%52.0%
GPQA Diamond78.8%82.3%84.3%58.6%
MMMU Pro (vision)69.1%73.8%76.9%52.6%
Tau2 agentic avg69.0%68.2%76.9%42.2%
FLEURS WER (audio, lower better)0.0690.08

On several agentic and multimodal scores, 12B sits close to the 26B active-MoE line while using less than half the memory footprint of the full 26B weight load—matching Google’s “laptop-class agent” positioning.

Latency tricks: MTP and LiteRT-LM

Gemma 4 12B ships with a multi-token prediction (MTP) draft model for speculative decoding—higher tokens/sec without changing output quality. For local apps, Google highlights:

  • Google AI Edge Gallery on macOS (offline on Apple Silicon, sandboxed Python in-chat)
  • Google AI Edge Eloquent — voice-edit style input with Gemma 12B
  • litert-lm serve — OpenAI-compatible local API for Continue, Aider, OpenCode, etc.
  • Ollama, LM Studio, llama.cpp, MLX, vLLM, SGLang — community inference paths
pip install -U transformers torch accelerate

# Instruction-tuned checkpoint
MODEL_ID = "google/gemma-4-12B-it"

# Local OpenAI-style server (LiteRT-LM)
litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve

Multimodal usage notes

  • Put images before text; put audio after text for best results.
  • Visual token budgets: 70, 140, 280, 560, 1120 — trade speed vs OCR/detail.
  • Audio clips up to 30 s; video up to 60 s at ~1 FPS (per model card).
  • Enable chain-of-thought with <|think|> in the system prompt; omit prior thoughts from chat history on follow-up turns.

Where it sits in the Gemma 4 family

SizeBest for
E2B / E4BPhones, browsers, 128K context, encoders + audio
12B UnifiedLaptops — encoder-free multimodal + audio + 256K
26B A4B MoEHigh throughput; ~3.8B active per token
31B DenseTop open leaderboard tier; vision without native audio

Gemma 4 downloads crossed 150 million (Google, June 2026). Weights and notebooks live on Hugging Face and Kaggle; production deploy paths include Model Garden, Cloud Run, and GKE.

Builder takeaway

QuestionAnswer
Why 12B vs E4B?Native audio at medium scale + stronger reasoning/coding without 26B memory
Why 12B vs 31B?Fits consumer GPU; encoder-free multimodal path; Apache 2.0 freedom
Fastest try?Ollama/LM Studio or google/gemma-4-12B-it in Transformers
Agents?Function calling, Gemma Skills repo, local litert-lm serve

Research supplement

The following context supplements the article based on publicly available documentation at time of writing.

  • Encoder-free multimodal precedent: The encoder-free approach for multimodal models was explored in models such as Fuyu-8B (Adept AI, 2023), which processed image patches directly without a vision encoder. Gemma 4 12B continues this architectural direction at a larger scale and with a substantially longer context window.
  • Official model card: Technical specifications, benchmark results (MMMU, VQA, text benchmarks), and quantisation guidance are published on the Hugging Face model card for google/gemma-4-12B.
  • Developer guide: Google's developer blog publishes a detailed integration guide at the Gemma 4 12B developer guide, covering recommended inference runtimes, hardware requirements, and example code.
  • Core Gemma documentation: The canonical Gemma documentation at ai.google.dev/gemma/docs/core covers model family architecture, safety evaluations, and terms of use across the Gemma 4 line.
---

References

Categories
News

Supertonic 3: 99M On-Device TTS With ONNX, 31 Languages, and 1,200+ Chars/sec

Supertonic 3 is Supertone’s open-weight text-to-speech stack that runs entirely on your device through ONNX Runtime—no GPU, no cloud API, and no per-character bill. The current checkpoint is a 99M-parameter multilingual model (31 languages, 44.1 kHz WAV) with throughput that can exceed 1,200 characters per second on Apple Silicon CPU and far higher on WebGPU or discrete GPUs.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  TEXT[Input text plus tags] --> ONNX[ONNX Runtime on device]
  ONNX --> WAV[44.1 kHz audio]
  WAV --> APP[Apps bots e-readers browsers]
  CLOUD[Cloud TTS API] -.->|optional| APP

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class ONNX agent
  class CLOUD hook
Comparison of API-based speech versus local ONNX inference

Supertonic keeps audio generation on your hardware—useful for privacy and offline agents.

Why on-device TTS matters again

Most production TTS still assumes a trade-off: cloud APIs buy quality and normalisation at the cost of latency, privacy, and ongoing spend. Supertonic targets the opposite pole—edge-native inference with a model small enough to ship in desktop agents, mobile apps, Raspberry Pi-class hardware, and browser demos (Hugging Face Spaces / WebGPU).

DimensionSupertonic 3 (official)
Parameters99M open weights (OpenRAIL-M)
Languages31 ISO codes + na language-agnostic mode
RuntimeONNX across Python, Node, WebGPU, Java, C++, C#, Go, Swift, Rust, Flutter, iOS
Output44.1 kHz 16-bit WAV
First-run download~400 MB model cache via Hugging Face (pip install supertonic)
Code licenseMIT (Python package and examples)
Characters per second on long text benchmarks from Supertone docs

On M4 Pro CPU, Supertonic 3 reports 1,263 characters per second on long inputs—far above typical cloud APIs.

Throughput and real-time factor

Supertone’s published benchmarks (measured from Seoul for cloud APIs) use characters per second and real-time factor (RTF)—audio duration divided by synthesis time. Lower RTF means faster-than-real-time generation; the inverse (×RT) is “how many seconds of speech per second of wall clock.”

SystemLong text (~266 chars) chars/sec
Supertonic — M4 Pro CPU1,263
Supertonic — M4 Pro WebGPU2,509
Supertonic — RTX 409012,164
ElevenLabs Flash v2.5 (API)287
OpenAI TTS-1 (API)82
Gemini 2.5 Flash TTS (API)24
Kokoro (open, M4 Pro ONNX)117

On long inputs, M4 Pro WebGPU RTF hits about 0.006—roughly 167× faster than real-time (1 ÷ 0.006). CPU long-text RTF ~0.012 still implies ~83× real-time. Independent site benchmarks also report ~5× real-time on a 16-thread CPU versus 800M-parameter GPU baselines on comparable sample sets.

Expression tags and text normalisation

Ten inline expression tags (for example <laugh>, <breath>, <sigh>) add prosody without reference audio. Supertone documents strong handling of numbers, currency, dates, phone numbers, and technical units in a “natural text handling” suite—areas where several cloud TTS APIs still misread abbreviations without preprocessing.

Voice Builder and preset voices

Built-in styles M1–M5 and F1–F5 ship with the model cache. Voice Builder exports zero-shot cloned voices as JSON for both Supertonic 2 and 3 profiles. Managed voices and commercial presets also live in Supertone Play / API for teams that do not want to self-host weights.

Drop-in local server (OpenAI-compatible)

pip install 'supertonic[serve]'
supertonic serve --host 127.0.0.1 --port 7788

curl -X POST http://127.0.0.1:7788/v1/audio/speech \
  -H 'content-type: application/json' \
  -d '{"model":"supertonic-3","input":"Hello from local TTS.","voice":"M1"}' \
  -o output.wav

The same engine exposes native POST /v1/tts with steps, speed, and lang knobs—handy for n8n, Home Assistant, Electron, or robotics stacks that already speak HTTP but should not send audio to the public internet.

Supertonic 2 vs 3

Social posts sometimes cite a 66M figure—that aligns with the older Supertonic 2 line (five languages), preserved on the release/supertonic-2 branch. Supertonic 3 (April–May 2026) expands to 31 languages, improved reading accuracy, and the 99M checkpoint documented on supertonic3.github.io.

Builder takeaway

QuestionAnswer
When to choose Supertonic?Offline agents, privacy-sensitive apps, high-volume narration, local dev without API keys
When to stay on cloud?Premium managed voices, strict SLAs, or compliance workflows tied to vendor APIs
Fastest path?pip install supertonicTTS(auto_download=True)lang="en" + voice M1
Quality knob?total_steps 5–12 (default 8); lower steps for speed

Research supplement

No additional reputable sources beyond the author's listed references were successfully retrieved during research for this article. The technical data cited in this piece (parameter count, language list, expression tags, licensing terms, download figures) derives from the official Hugging Face model card, which is already listed as reference #4. Readers seeking independent benchmarks or naturalness evaluations should search for community evaluations comparing Supertonic 3 against Kokoro-82M, Piper TTS, and MeloTTS, as no such independent evaluations were found in sources reviewed at time of publication.

References

Categories
News

Codex Sites: Build and Host Internal Web Apps from a Single @Sites Prompt

Codex Sites is OpenAI’s managed hosting layer inside the Codex app: turn a prompt—or an existing repo—into a workspace-private web app with authentication, optional D1 database and R2 file storage, and a shareable production URL—without wiring your own deploy pipeline.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  PROMPT[Prompt or repo in Codex] --> BUILD[Sites plugin builds Worker output]
  BUILD --> SAVE[Save version optional review]
  SAVE --> DEPLOY[Deploy to production URL]
  DEPLOY --> SHARE[Workspace users via URL]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class BUILD agent
  class DEPLOY hook
End-to-end flow from idea in Codex to authenticated private URL

Sites adds managed hosting so teams ship internal apps without a separate deploy stack.

What Sites changes for teams

Software work is rarely “just code”—it is dashboards, review rooms, planners, and internal tools that never earned a sprint. Sites targets that gap: non-developers and developers alike can describe an interactive experience in Codex, iterate in-thread, then host it on OpenAI infrastructure. OpenAI’s June 2026 launch positions Sites alongside role-specific plugins and annotations as Codex expands beyond pure engineering into knowledge work.

CapabilityWhat you get
HostingOpenAI-managed; output is Cloudflare Worker–compatible ES modules
SharingURL inside your ChatGPT Business or Enterprise workspace
AuthWorkspace identity for internal apps; optional external IdP for auth-enabled projects
PersistenceD1 (SQLite-style relational data), R2 (object/file bytes)
Static assetsSupported as part of the deployed site bundle
TriggerInstall the Sites plugin; prompt with @Sites in a new thread
Two-step publish: reviewable save, then production URL

Every Sites URL is production—save a version first when you need review.

Save versus deploy (critical workflow)

Every deployed Sites URL is a production deployment. OpenAI splits publishing into two steps so teams do not accidentally ship half-baked builds:

  • Save a version — Codex builds the site, links it to the source Git commit, and leaves a reviewable candidate.
  • Deploy a version — Publishes a saved build and returns the live URL only when you intend people to use it.

Project metadata lives in .openai/hosting.json (project ID, D1/R2 binding names). Secrets and environment variables belong in the Sites panel, not in that JSON file or committed .env values.

Access control modes

ModeWho can open the site
admins_onlyOwner + workspace admins (default posture for new sites)
workspace_allAll active workspace members
customNamed users or groups you select

Example use cases from OpenAI’s showcase

  • Onboarding hub — first-week progress, meetings, and tasks
  • Enablement library — searchable learning paths and updates
  • Pulse dashboard — executive KPIs with targets and health signals
  • Sparkboard — idea intake with voting, comments, and rankings (D1 + workspace auth)
  • Launch calendar — planning filters and risk flags
  • Scenario planner — interactive assumptions instead of spreadsheet tabs

Availability and admin setup

Sites is in preview for ChatGPT Business and Enterprise workspaces, with broader plan rollout planned. Enterprise admins must enable Sites via RBAC in ChatGPT admin settings; Business workspaces ship with Sites enabled by default. OpenAI notes rollout through the Codex app (CLI/desktop also support plugins; hosting itself is OpenAI-operated).

Keeping data fresh

For live third-party data, you can supply API keys or use thread automations: scheduled plugin runs fetch data, Codex updates the app, and redeploys. That pattern suits reporting views that must track Slack, docs, or warehouse exports without manual copy-paste.

@Sites Build a project request dashboard for my operations team.
Let members submit requests, see owners, update status, and filter the list.
Require workspace sign-in and persist request data between visits.

@Sites Deploy this project. Check Sites compatibility and return the deployment URL.

How Sites fits the Codex platform

Sites is not a consumer website builder bolt-on—it is a Codex plugin plus managed host in the same product surface as cloud sandboxes, GitHub integration, and role plugins (sales, data analytics, design, and others). OpenAI reports more than 5 million weekly Codex users, with non-developers growing faster than engineers—Sites is the hosted output layer for that audience.

Builder takeaway

QuestionAnswer
When to use Sites?Internal tools, dashboards, and shared workspaces—not public marketing sites without reviewing auth docs
First step?Add Sites plugin → new thread → @Sites + clear audience and data needs
Main risk?Deploying before review—always save/version first on sensitive data
Pricing?Free during preview; future pricing TBD per OpenAI

Research supplement

Web search was unavailable during drafting. The four reference links supplied by the author — the Codex Sites developer documentation, the Codex for every role blog post, the build-and-deploy-internal-apps use case page, and the Sites showcase — are the primary authoritative sources for this feature and are already cited inline. No additional independently verified sources could be confirmed without live search access.

---

References

Categories
News

MAI-Voice-2 Flash: Ultra-Low-Latency Speech for Voice Agents (Preview)

MAI-Voice-2-Flash is the efficiency tier Microsoft announced alongside MAI-Voice-2 at Build 2026—aimed at ultra-low-latency voice agents where cost per character and time-to-first-audio matter more than maximum expressiveness.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  AGENT[Real-time voice agent] --> CHOICE{Latency SLA}
  CHOICE -->|Rich prosody| V2[MAI-Voice-2 GA]
  CHOICE -->|Sub-second turns| FLASH[MAI-Voice-2-Flash soon]
  FLASH --> RT[Contact centre bots]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class V2 agent
  class FLASH hook
Low latency voice agent stack using efficient TTS tier

Voice-2-Flash is announced for latency-sensitive agents; full Voice-2 is available now on Azure Speech.

Status at Build 2026

MAI-Voice-2 is generally available on Azure Speech today. MAI-Voice-2-Flash is listed as coming soon in the seven-model family announcement and keynote—positioned for the 2026 wave of real-time voice agents that cannot wait on full-fidelity synthesis.

Flash versus full Voice-2

DimensionMAI-Voice-2MAI-Voice-2-Flash (announced)
AvailabilityFoundry / Azure Speech nowPreview roadmap — not GA at launch post
StrengthEmotion range, cloning, 15-language depthSpeed and cost for agent loops
Typical callerAudiobooks, branded assistantsSub-second interactive agents

How to plan architecture now

  • Ship production on MAI-Voice-2 while Flash is pending; abstract provider behind your TTS interface.
  • Pair with MAI-Transcribe-1.5 for the inbound leg of full-duplex agents.
  • Keep consent and voice-licence checks in your app layer—Flash inherits the same policy model.

Microsoft’s model page shows Voice-2 at about $0.22 per 1M characters on the marketing site; Flash pricing will likely undercut that once GA—watch Foundry catalog updates after Build.

# Today: Azure Speech + MAI-Voice-2 in Foundry
# Watch: Foundry catalog for MAI-Voice-2-Flash GA and agent latency SLOs

Research supplement

Web search and page fetch were not available during this run. No external sources could be retrieved or verified beyond the reference URLs supplied by the author. The RESEARCH_SUPPLEMENT is left empty accordingly; the author's own reference links (the Microsoft AI news pages for the seven-model launch and the MAI-Voice-2 expressive speech announcement) are the authoritative primary sources and are already cited in the article.

---

References

Categories
News

MAI-Voice-2: Multilingual Expressive TTS with Zero-Shot Voice Cloning

MAI-Voice-2 is Microsoft AI’s expressive text-to-speech upgrade—15+ languages, emotion and role tags, zero-shot voice cloning from a short reference clip, and consent guardrails for production brand voices on Azure Foundry.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  TEXT[Script plus emotion tags] --> V2[MAI-Voice-2]
  CLIP[5 to 60s reference audio] --> V2
  V2 --> SPEECH[Multilingual speech]
  SPEECH --> PROD[Assistants audiobooks support]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class V2 agent
  class CLIP hook
Short reference clip to branded multilingual speech with emotion tags

Voice-2 adds 15-language coverage, emotion tags, and consent-gated voice cloning on Foundry.

Capabilities versus MAI-Voice-1

FeatureMAI-Voice-2
Human preference vs Voice-172% side-by-side wins (Microsoft internal)
Languages15 core locales plus extended list (US/AU English, EU languages, Hindi, Korean, Chinese, Turkish, Russian, Thai, etc.)
Emotion controlTags such as whispered, excited, embarrassed
Code-switchingHindi–English, Spanish–English mid-sentence
Long-form stabilityPodcasts, lectures, audiobooks without speaker drift
Foundry pricing (Tech Community)From $22 / 1M characters via Azure Speech

Voice cloning and safety

Developers upload 5–60 seconds of authorised reference audio—no per-voice fine-tune. Production synthesis requires licensed/consented voices; Microsoft blocks unlicensed cloning at the system level. Access to cloning features is gated through an application workflow on Foundry.

Product integrations

  • Azure Foundry and Azure Speech SDK
  • Rollout into VS Code experiences and Dynamics 365 Contact Center
  • DuoAI demo combining Voice-2, Transcribe-1.5, and Image-2.5 in multi-agent dialogue

Typical workloads

  • Branded assistants and IVR with consistent persona
  • Game, podcast, and AR character dialogue
  • Accessibility narration and education characters
  • Creator workflows: text → audio in your own licensed voice

Research supplement

Web search and page fetch were unavailable during production of this post (permissions not granted). No additional verified external sources could be retrieved beyond the reference links provided by the author. The RESEARCH_SUPPLEMENT is therefore empty for this publication; it can be updated once benchmark results, independent evaluations, or additional primary sources become available.

---

References

Categories
News

MAI-Transcribe-1.5: 43-Language Speech-to-Text with Keyword Biasing

MAI-Transcribe-1.5 is Microsoft AI’s multilingual speech-to-text refresh—43 languages, best-in-class FLEURS word-error-rate claims, keyword biasing for enterprise vocabulary, and batch speeds Microsoft positions as up to 5× faster than comparable Gemini and OpenAI transcription on long audio.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  AUDIO[Noisy long-form audio] --> STT[MAI-Transcribe-1.5]
  KW[Keyword list] --> STT
  STT --> TXT[Domain-aware transcript]
  TXT --> APPS[Teams Copilot Foundry]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class STT agent
  class KW hook
Audio to transcript flow with domain keyword biasing

Transcribe-1.5 expands from 25 to 43 languages while claiming best-in-class FLEURS WER.

Accuracy and coverage

MetricMAI-Transcribe-1.5 (Microsoft claims)
Languages43 (up from 25 on Transcribe-1)
FLEURSBest-in-class WER across supported set
Artificial Analysis open bench2.4% WER, #3 overall
Long audio speed~1 hour transcribed in under 15 seconds (batch)

Keyword biasing in practice

Enterprises pass lists of product names, clinician terms, or internal acronyms. The model uses context to decide when to bias—not blind string replacement—cutting WER on FLEURS by up to 30% in Microsoft’s tests. That matters for meetings where a generic STT model garbles uncommon names.

Where it ships

  • Microsoft Foundry and Azure Speech APIs
  • Integration into Copilot, Teams, GitHub, Dynamics 365 Contact Center
  • MAI Playground and DuoAI demo alongside Voice and Image models

Pricing and roadmap

Foundry pricing remains about $0.36 per audio hour on the Transcribe line. Microsoft lists upcoming features: diarization, a native streaming API for live agents, and deeper per-language tuning.

Builder takeaway

Use caseWhy 1.5
Call-centre QASpeed + domain keywords
Global meetings43-language FLEURS-leading claim
Batch media archivesSub-15s per hour audio (Microsoft batch figures)

Research supplement

No live web search or fetch was available for this article run. The following notes flag areas where additional sourced research would strengthen the piece; any editor or author with access to these resources should verify before citing.

  • Keyword/contextual biasing in ASR: The technique has a substantial academic literature. Google's 2018 paper "Contextual Speech Recognition with Difficult Negative Training Examples" and subsequent work on shallow fusion and deep contextual biasing are primary sources. Deepgram and AssemblyAI have published developer blog posts on their implementations that would provide useful comparison points.
  • Microsoft MAI model portfolio: The canonical source is Microsoft's Build 2026 announcement of seven new MAI models, which situates MAI-Transcribe-1.5 within a broader first-party model strategy.
  • Competitive ASR benchmarks: The official MAI-Transcribe-1.5 model page is the authoritative source for any accuracy figures. Independent benchmarks on OpenASR leaderboards or Hugging Face Open ASR Leaderboard would provide third-party validation.

References

Categories
News

MAI-Image-2.5 Flash: Low-Cost Production Image Generation on Foundry

MAI-Image-2.5-Flash is the production-speed sibling of Microsoft’s fidelity-first image model—same text-to-image and editing surface on Foundry, with lower per-token pricing and a diffusion stack tuned for scalable workloads at Build 2026.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  JOB[High volume image job] --> ROUTE{Latency budget}
  ROUTE -->|Quality first| FULL[MAI-Image-2.5]
  ROUTE -->|Cost and speed| FLASH[MAI-Image-2.5-Flash]
  FLASH --> CDN[Apps and batch pipelines]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class FULL agent
  class FLASH hook
Comparison of fidelity tier versus flash tier for image workloads

Flash trades a small Arena score gap for roughly 4x lower output token pricing on Foundry.

Flash versus full 2.5

DimensionMAI-Image-2.5MAI-Image-2.5-Flash
GoalMaximum fidelity and edit controlFast, cost-efficient production
Arena positioningNo. 2 image edit (Jun 2026)Ultra-efficient tier; strong price-to-score
Foundry text input$5 / 1M tokens$1.75 / 1M tokens
Foundry image input$8 / 1M tokens$1.75 / 1M tokens
Foundry image output$47 / 1M tokens$19.50 / 1M tokens

Capabilities retained on Flash

Foundry lists Flash as a diffusion-based model supporting both text-to-image and image-to-image editing with controllable prompts—not a separate API surface, but a cheaper compute profile for the same modality family announced alongside MAI-Image-2.5 at Build.

When to pick Flash

  • Thumbnail or catalog generation at scale
  • A/B creative variants where throughput beats last-point Arena score
  • Agent loops that call image tools repeatedly (pair with token budgets)

Use full MAI-Image-2.5 when identity-sensitive edits, hero marketing assets, or maximum text-in-image quality are non-negotiable.

# Foundry Model Catalog → MAI-Image-2.5-Flash
# Deploy like other Foundry image models; compare PTU vs pay-as-you-go for steady load

Research supplement

Web search and fetch were not available in this session. No independently verified sources could be retrieved. The analysis draws on the article title, reference URL titles, and general knowledge of the Azure AI Foundry ecosystem and image generation market as of mid-2026.

For primary source verification, the following are the canonical references to consult:

References

Categories
News

MAI-Image-2.5: Arena #2 Image Editing and Foundry Pricing Explained

MAI-Image-2.5 is Microsoft AI’s high-fidelity image generation and editing model—ranking No. 2 on Arena’s image-edit leaderboard (June 2026) and shipping in PowerPoint, OneDrive, Foundry, and OpenRouter with transparent per-token pricing.

%%{init: {"theme": "base", "themeVariables": {"background": "transparent", "lineColor": "#000000"}}}%%
graph TD
  PROMPT[Text or image prompt] --> GEN[MAI-Image-2.5]
  GEN --> T2I[Text-to-image]
  GEN --> EDIT[Localized edits]
  EDIT --> ID[Identity preserved]
  T2I --> OUT[Design-ready asset]

  classDef agent fill:#8B0000,color:#fff
  classDef hook fill:#189AB4,color:#fff
  class GEN agent
  class EDIT hook
Text-to-image and localized edit workflow with identity preservation

MAI-Image-2.5 targets maximum fidelity for design-ready generation and precise edits.

What changed versus MAI-Image-2

  • Text rendering — sharper product and slide copy in generated frames.
  • Scene reasoning — lighting, scale, and perspective when inserting objects.
  • Localized edits — replace objects or text without repainting the whole image.
  • Face consistency — identity held across pose and expression changes.

Arena and leaderboard position

LeaderboardMAI-Image-2.5 position (Microsoft, Jun 2026)
Text-to-image (overall)No. 3 family score
Image editing (single-image)No. 2 — ahead of Nano Banana 2.x on cited Arena run
vs prior MAI-Image-2~+75 Arena points overall; largest gains in text rendering (+107) and cartoon/anime (+90)

Microsoft product integration

PowerPoint uses MAI-Image-2.5 for presentation visuals; OneDrive is rolling out distraction removal and background cleanup with scene preservation. Developers access the same weights through Microsoft Foundry and the MAI Playground.

Foundry pricing (fidelity tier)

Token typePrice (USD per 1M tokens)
Text input$5
Image input$8
Image output$47

For high-volume pipelines, pair this tier with MAI-Image-2.5-Flash when latency and cost matter more than maximum Arena score.

Safety notes

Layered prompt/output filtering ships with the model. Like all generative image systems, outputs can be plausible but wrong—review before medical, legal, financial, or news use.

Research supplement

Web access was unavailable during drafting. The following research directions are recommended for supplementing this article before or after publication:

  • Arena leaderboard source: Identify which organisation runs the image editing arena referenced in the #2 ranking (likely WildVision Arena, GenAI-Bench, or a Microsoft-adjacent evaluation). Linking to the live leaderboard would give readers a way to verify the claim and track future ranking changes.
  • Azure AI Foundry pricing page: The official pricing page for MAI-Image-2.5 on Azure (ai.azure.com/catalog/models/MAI-Image-2.5) should be cited directly with current figures. Pricing changes frequently and inline quotes from a live article are more reliable than archived values.
  • Build 2026 announcement blog: The seven-model drop is documented at microsoft.ai/news/building-a-hillclimbing-machine-launching-seven-new-mai-models/ — cross-referencing the other six models and their domains would situate MAI-Image-2.5 within the broader family launch.
---

References