Forem

The 7-Layer Memory Architecture Behind Modern AI Agents

Mahmoud Zalt — Sat, 23 May 2026 04:17:32 +0000

How do you make an AI agent actually remember?

It is the question that inevitably surfaces once an AI system moves out of prototyping and into long-running production. Why does it forget a core constraint after a week? Why does it re-introduce itself every morning? Why does it pick the wrong tool even though it was corrected three days ago?
At Sistava, where you can hire autonomous AI employees, we had to solve this problem to survive. We run a workforce of around 1,000 AI employees in production, operating continuously across live environments for over two months. At this scale, standard context strategies fail. These systems don't get a polite session reset; they face a massive real-world hurdle: facts change over time.
If a user utilizes Gmail today and switches to Outlook next month, an agent needs to track both. It has to know which one is current, exactly when the switch happened, and it cannot act like the old truth is still valid. Standard vector database similarity scores do not understand chronological decay or truth overrides. Mix old and new context, and the agent confidently fabricates or forgets the one detail that mattered.
After extensive runtime experience scaling this workforce, the obvious answer - pick a vector store, dump text chunks in, and hope for the best - completely broke. Memory in a long-running agent isn't a single database. It requires at least seven distinct layers running in parallel across multiple database types.

The Architectural Split (The CoALA Framework)
The academic literature has already recognized these limitations. The seminal CoALA paper (Princeton, 2023) formalized the episodic, semantic, and procedural split from cognitive science for language model agents. It outlines modular components: working memory as a short-term scratchpad, plus long-term episodic for experiences, semantic for facts, and procedural for skills.
In a production environment, each of these layers requires its own write rules, its own lifecycle, and its own read path. They cannot run as a loose stack; they must be isolated so they do not contaminate one another.

Working Memory
This is the active, per-turn scratchpad holding the immediate plan-so-far, the raw tool output that just came back, or transient chain-of-thought reasoning. It lives entirely within the LLM's native context window or as an in-memory variable in the runtime environment.
The Production Lesson: Do not let working memory leak. Transient scratchwork must never accidentally flush into long-term storage, or the agent will begin writing unverified thoughts into its historical knowledge base. Enforce a hard wall - working memory has no persistent backing store. It lives, it dies, it is gone.
Conversation Memory
This tracks the immediate message history so the agent doesn't have to re-derive the active thread context on every turn. Most modern agent frameworks ship a checkpointer that auto-loads thread history from a Postgres backend on invocation.
The Production Lesson: Run a summarizer middleware that triggers when the live conversation crosses a strict token threshold. It compresses older turns into a single structural system message while keeping the recent tail intact, maintaining a dense, cost-efficient context window.
Episodic Memory
A time-indexed log of past execution loops, historical runs, and specifically, the failures ("Last Tuesday the webhook timed out, so I routed through the fallback queue"). It provides chronological continuity.
The Production Lesson: A vector store alone fails here because similarity scoring doesn't understand time. Store raw transcripts alongside LLM-generated execution summaries, keyed explicitly by thread_id and timestamp. Use a background cron job to truncate older episodes to summaries only, rather than forcing the agent to handle eviction at runtime.

sistava.com memory inspection4. Semantic Memory
This stores slow-changing, deterministic facts about the user, the business, or integrated tools ("The core platform is called Atlas", "The manager prefers brief markdown reports"). It is edited in place, never blindly appended.
The Production Lesson: Split this layer into two distinct substrates: a human-editable markdown file (the "Sovereign Notebook") and an LLM-extracted graph. If they disagree, the notebook explicitly wins. This gives operators a clear vector to intervene; if an extracted fact is noisy, a manual entry in the notebook out-votes the graph noise on equal footing.

Knowledge Graph While semantic memory holds raw facts, the knowledge graph maps the structural edges between entities - who did what, which event caused what, or which entity is a duplicate of another. A vector store treats text chunks like isolated islands; a graph database (such as Neo4j, Memgraph, or KuzuDB) connects them. It allows an agent to walk contextually from a specific customer entity straight to the exact email thread where a pricing tier was modified without re-reading thousands of irrelevant chunks. AI Employee knowledge graph at sistava.com

Handling Changing Realities: Temporal Edges
The non-obvious requirement of the graph layer is temporal awareness. To handle shifting user preferences or infrastructure changes over months of runtime, you must stop deleting or overwriting data when state changes.
Instead, every extracted fact in the semantic and graph layers needs a valid_at and invalid_at timestamp:
(User) -[USES_TOOL {valid_at: "2024-01-01", invalid_at: "2026-02-15"}]-> (Gmail)
(User) -[USES_TOOL {valid_at: "2026-02-16", invalid_at: null}]--> (Outlook)
When today's session contradicts yesterday's state, the ingestion pipeline invalidates the old edge instead of erasing it. This preserves a clean, immutable audit trail, allowing the LLM to logically reason about when a preference shifted or an infrastructure stack was updated.
The Build vs. Buy Lesson: Do not write this temporal logic yourself. Utilize open-source libraries that sit on top of your graph DB to handle the LLM-driven extraction, deduplication, and contradiction detection. Writing relationship-inference engines from scratch can easily burn six months of development time.

Procedural Memory
Procedural memory stores execution mechanics and behavioral habits, not world facts. It dictates how an agent performs tasks ("When checking a raw CSV dataset, first validate header consistency").
This data lives in structured skill files (typically markdown documents) that the agent loads on demand based on task routing. Some are explicitly authored by engineers; others are written by the agent itself during asynchronous self-reflection steps.
The Production Lesson: Keep semantic and procedural data separated. A fact like "The client uses Slack" is semantic and belongs in the notebook. A rule like "When notifying via a webhook, format payload fields as snake_case" is procedural and belongs in a skill file.
Checkpoints
Operating underneath all other layers is a highly serializable, low-latency snapshot of the exact execution state of an agent workflow. This is not thread history; it is the active node in the graph, the pending tool payloads, and the unwritten output stream.
It is the difference between a background container crashing and losing a forty-minute execution loop, or surviving a pod restart and picking up cleanly at minute thirty-three. Utilizing a durable execution engine like Temporal gives you deterministic checkpointing at every activity boundary out of the box.

Infrastructure Matrix & Preventing Contamination
To maintain performance, these layers require separate storage shapes, read patterns, and write triggers:
LayerStorage ShapeWrite TriggerRead PatternWorkingIn-memory scratchpadPer-turn executionNative context window injectionConversationAppend-only log + summarizerEvery incoming messageAuto-loaded on invocationEpisodicTime-indexed transcript + JSON summariesPost-message background workerRecency-weighted semantic retrievalSemantic (Notebook)Single editable Markdown fileExplicit agent tool writesFull text injected to promptSemantic (Facts)Graph DB (Neo4j class)Auto-extracted post-messageEntity-anchored sub-graph matchingKnowledge GraphGraph DB with temporal propertiesUnified extraction loop with factsContextual edge-walking between nodesProceduralMarkdown skill filesHuman authorship or reflectionDynamically loaded based on taskCheckpointsKV Store / Postgres / Workflow engineEvery single execution stepInstantly restored on worker restart
Preventing Contamination
Naming the layers is straightforward; wiring them without cross-contamination is where production pipelines fail.
Episodic leaking into Semantic: If every line of a historical brainstorming session gets extracted as a hard "fact," the agent will interpret a transient hypothetical idea as absolute truth. Enforce strict LLM confidence thresholds or run your fact extraction pipelines on summarized episodes rather than raw chat logs.
Conversation leaking into the Graph: Active conversation is full of throwaway syntax and short pleasantries. Ingesting every message verbatim fills a graph database with garbage nodes. Enforce length-gated ingestion filters to skip processing short, transactional messages.

Managing Upstream LLM Costs
An advanced knowledge graph ingestion pipeline requires between five to nine discrete LLM calls per message (handling entity extraction, graph deduplication, relationship inference, contradiction testing, and entity summary updates), alongside multiple embedding calls. Multiplied across thousands of active conversations running concurrently, background memory costs can quickly eclipse primary agent execution costs.
To keep this sustainable at scale, bake in kill switches and per-tenant gates from day one. Every layer running unattended in the background must have a configuration-level flag or a feature toggle. When an upstream model update or unexpected schema change causes an extraction loop to degrade or spin out of control, you need a way to stop the financial bleeding instantly without triggering an emergency production redeployment.
The Rebuild Blueprint
If you are starting over building an agent memory infrastructure today, this is the recommended development order:
Map the Concerns First: Do not select an orchestration framework based on hype. Map how your system will handle these seven distinct concerns before writing application logic.
Postgres for Foundations: Use Postgres for conversation history and step-level checkpointing. Boring, ACID-compliant storage is exactly what you want here.
Path-Routed KV for Filesystems: Implement a simple key-value store for notebooks and skill files, allowing the agent to interact with its procedural knowledge using clean, standard filesystem tool calls.
Native Graph + Temporal Constraints: Deploy a native graph database (Neo4j, Memgraph, or KuzuDB) paired with an off-the-shelf library that manages temporal constraints natively.
Tight Vector Tooling: Use a highly optimized vector store (pgvector, Qdrant, or Weaviate) specifically to index static external knowledge documents like Notion workspaces, Slack history, or uploaded manuals.

Ultimately, separating transient reasoning from immutable history and structured relational facts is what transforms a fragile chatbot into a reliable system. By treating memory as a multi-layered infrastructure concern, you build an environment where an agent's capability doesn't degrade over time , it compounds.
For more details continue reading at https://sistava.com/en/insights/ai-agent-memory.
sistava.com knowledge ingestionBuilding Agent Memory Yourself?
The seven layers, the wiring, and the cost ceilings are a lot to get right on the first run.
If you want this exact architecture adapted to your tech stack, check out our support options at Sista AI. If you would rather talk engineer-to-engineer, I take a few of these architectural deep dives personally. You can reach me directly at zalt.me.

I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI

rishi — Sat, 23 May 2026 04:15:26 +0000

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

I Imagined Hermes Agent Running an Entire Smart City — And That Changed How I See AI

Most people still think of AI as a chatbot.

But while exploring Hermes Agent, I realized something much bigger:

We are entering an era where AI systems won’t just respond.
They’ll reason, plan, analyze, and take action.

As a Generative AI student who loves building real-world projects, this idea instantly fascinated me.

And it completely changed how I started thinking about one of my own concepts:
Trafiq AI.

From Chatbots to Autonomous Systems

For the last few years, most AI projects have followed a simple pattern:

Input → Response.

But Hermes Agent feels different.

Instead of behaving like a traditional assistant, it introduces something far more powerful:

planning,
tool usage,
reasoning,
and multi-step execution.

That shift may sound technical.

But honestly?

It changes everything.

Because once AI systems can reason through problems step-by-step, they stop feeling like simple software tools and start behaving more like intelligent systems.

The Moment Trafiq AI Started Making Sense

Recently, I worked on a concept called Trafiq AI — an AI-driven smart traffic system focused on:

congestion analysis,
route optimization,
predictive traffic monitoring,
and intelligent transportation insights.

At first, I imagined it as a dashboard.

But after exploring Hermes Agent, I started imagining something much more advanced.

What if the system could actually think through traffic problems?

What if an AI agent could:

monitor live congestion,
detect unusual traffic patterns,
prioritize emergency vehicles,
reroute traffic dynamically,
and generate real-time recommendations automatically?

That’s when I realized:

Agentic AI systems may become the operating layer behind future smart cities.

And honestly, that idea feels insane in the best possible way.

Why Hermes Agent Feels Important

The biggest thing that impressed me about Hermes Agent is accessibility.

Usually, advanced AI systems feel locked behind massive infrastructure and enterprise ecosystems.

But open-source agentic systems change that dynamic completely.

Now students and independent developers can experiment with:

autonomous workflows,
AI research systems,
intelligent assistants,
automation pipelines,
and decision-making agents

without needing huge resources.

That democratization matters a lot.

Because innovation becomes faster when more people can build.

AI Is Quietly Entering a New Phase

I think we are slowly moving beyond the “AI chatbot era.”

The next phase feels more like:

AI systems coordinating tasks,
using tools intelligently,
reasoning through workflows,
and collaborating with humans.

That’s a much bigger shift than most people realize.

And platforms like Hermes Agent are giving developers an early look at that future.

What Excites Me as a Student Developer

As someone passionate about Generative AI, hackathons, and building practical systems, this future feels incredibly motivating.

A few years ago, building intelligent multi-step systems like this would have sounded unrealistic for students.

Now it’s becoming possible with open ecosystems and modern AI tooling.

That’s powerful.

Because the next breakthrough idea might not come from a giant company.

It could come from:

a student,
a small developer team,
or someone experimenting late at night with open-source AI agents.

Honestly, that possibility is what excites me most.

Final Thoughts

Hermes Agent didn’t just make me think about better AI tools.

It made me think about a future where AI systems actively help run complex environments, assist decision-making, and solve real-world problems dynamically.

From smart kitchens to intelligent traffic systems like Trafiq AI, the future of AI feels less about simple conversations and more about intelligent action.

And after exploring agentic systems, one thing feels clear:

We are only at the beginning of what autonomous AI can become.

hermesagentchallenge #devchallenge #agents #ai

AULA — The AI tutor that fits in a browser tab, built for the students the internet leaves behind

Juan Pablo Enriquez Ortiz — Sat, 23 May 2026 04:13:16 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

AULA is a complete AI tutoring platform that runs Google's Gemma 4 entirely inside the browser — no server, no account, no internet required after the first 1.5 GB download. It is designed for the 65+ million Latin American students living in areas where reliable internet is the exception, not the norm.

The premise is simple: if Gemma 4 can run on a Raspberry Pi 5, it can run on a teacher's laptop in rural Boyacá, Colombia. With WebGPU and MediaPipe, this is now possible — and AULA is what that looks like as a finished product.

The problem AULA solves

In Latin America, ~40% of students live with unreliable, capped, or non-existent connectivity. ChatGPT, Gemini, Khan Academy's AI tutor — all require a stable connection. The very tools that could close the global education gap are inaccessible exactly where they are needed most.

AULA flips this: the AI runs on the student's device, not on a server thousands of miles away.

What AULA does — offline (100% local, Gemma 4 E2B)

After loading once, these features work with WiFi off, in airplane mode, in a rural school with no signal:

🎓 Conversational tutor — chat with Gemma 4 in natural language. Full LaTeX rendering for math and science. ~15 tokens/sec on a mid-range laptop GPU.
🧮 Scientific calculator that teaches — visual keypad with trig functions, exponents, roots. Gemma 4 doesn't just solve. It explains the why.
🎙️ Voice tutoring (bidirectional) — ask by speaking, listen to the response. Optional hands-free mode chains them together.
🦉 Socratic mode — Gemma 4 stops giving answers and only asks guiding questions. Pedagogy-first.
🤔 "Explain it simpler" — three escalating reformulation levels on demand.
💡 Conceptual error detection — Gemma 4 diagnoses which concept the student misunderstood, not just "wrong, try again".
📚 Persistent study sessions in IndexedDB. No cloud sync ever.
♿ Accessibility first — high contrast, large text, easy reading mode (for dyslexia), auto-read responses.
🌍 Spanish ↔ English — full i18n. System prompts translate, not just the labels.
🏆 Local gamification — XP, levels, streak, achievements. All in the browser.

What AULA does — Cloud Boost (optional, Gemma 4 26B-A4B)

For features that require strict structured output (which is beyond what a 2B-parameter model can do reliably), AULA routes through the user's own free Google AI Studio API key:

✍️ Handwritten whiteboard — draw equations with finger or mouse, Gemma 4 reads and solves.
📷 Photo OCR + reasoning — point camera at a printed exercise, get a step-by-step solution.
♾️ Infinite adaptive practice — exercises that never repeat, with difficulty calibrated dynamically.
🎯 Interactive student quiz — self-assessment with scoring and per-error conceptual review.
👩‍🏫 Teacher mode with PDF export — generate quizzes, export student/teacher PDFs ready to print.
🎨 SVG illustrations — Gemma 4 generates educational diagrams.
🗺️ Mermaid mind maps — concept diagrams rendered interactively, downloadable as PNG/SVG.

Critical: Cloud Boost is always opt-in. AULA never sends data without an explicit API key configured by the user. The core educational experience never requires the internet.

Demo

🎥 Watch the 2-minute walkthrough: https://youtu.be/d0jN8Kw_Cz4

🔗 Live demo: https://aula.run (or local: pnpm dev -p 3100 after cloning)

Key screenshots

Chat tutor running 100% locally with full LaTeX rendering

Mermaid mind maps generated by Gemma 4 — click to enlarge, download as PNG

SVG illustrations — educational diagrams generated by Gemma 4

Scientific calculator that explains, powered locally

Teacher mode with PDF export — ready for classroom

Accessibility built-in: high contrast mode

Code

🔗 Repository: https://github.com/jpablortiz96/aula

The repo includes a comprehensive README with architecture diagrams, hardware benchmarks across devices (Raspberry Pi 5 to RTX 3050 to MacBook M3), full tech stack documentation, and a roadmap for v1.1 through v3.0.

License: MIT

How I Used Gemma 4

AULA uses a dual-engine architecture with intentional model selection for each tier:

Model	Variant	Where it runs	What it powers
Gemma 4 E2B-IT	~1.5 GB (q4f16 quantized)	Browser, via MediaPipe + WebGPU	All offline features
Gemma 4 26B-A4B-IT	Cloud (MoE)	Gemini API	Structured-output features

Why Gemma 4 E2B for local

The E2B variant is the only Gemma 4 model that fits realistically on consumer hardware while preserving the multimodal capability path. It runs at:

~15 tokens/sec on an NVIDIA RTX 3050 laptop
~20-25 tokens/sec on a MacBook M3
~7 tokens/sec on a Raspberry Pi 5 (CPU fallback)

This range covers every realistic device a Latin American student or teacher might have access to — from a $80 SBC to a school laptop. The 31B Dense model would never fit in a browser tab; the 26B MoE requires server-grade resources. E2B is the only viable choice for the rural offline use case, and that's exactly why I picked it.

Why Gemma 4 26B-A4B for cloud-enhanced features

Some features in AULA require strict structured output: JSON for quiz exercises, syntactically-valid Mermaid for mind maps, coherent SVG for illustrations. Small models are unreliable for this — they're brilliant at conversation but tend to add prose around JSON, produce malformed SVG, or break Mermaid syntax.

Rather than fight this limitation or hide it, AULA makes the routing explicit and visible to the user. Every screen shows which engine answered: green badge for local, blue badge for cloud. The 26B-A4B variant gives me near-31B quality at substantially lower latency thanks to its mixture-of-experts architecture — ideal for short structured outputs.

Technical challenges I solved

1. transformers.js was not viable on NVIDIA Optimus laptops.
My first prototype used transformers.js + WebGPU. On an RTX 3050, I got 2 tokens/sec because dispatch was routing through the iGPU. Migrating to MediaPipe's WebGPU delegate unlocked 14-16 tokens/sec on the same hardware — a 7x improvement. MediaPipe is Google's official runtime for Gemma 4 on edge, and the difference is real.

2. Concurrency on LlmInference is exclusive.
A single MediaPipe LlmInference instance processes one prompt at a time. When /chat and /practice competed for the same singleton, the model locked with Previous invocation or loading is still ongoing. I implemented a FIFO queue with abort propagation across navigations, plus a forceReset() recovery path.

3. Gemma 4 26B does not support streamGenerateContent reliably.
This took an afternoon of DevTools debugging to identify: calling :streamGenerateContent returned 400, while :generateContent (no streaming) worked perfectly. The fix was creating a separate cloudNoStream.ts helper for Practice, Illustrator, and Mermaid — features that don't benefit from streaming anyway since the user is waiting for one complete response.

4. Easy Reading Mode is more than a CSS toggle.
For students with dyslexia or reading difficulties, AULA changes both the visual presentation (letter spacing, line height, max-width) and the system prompt sent to Gemma 4 ("Short sentences. Simple vocabulary. One idea per line."). This is the kind of accessibility that AI uniquely enables — the model adapts its output style, not just the typography.

What Gemma 4 unlocked that wasn't possible 18 months ago

Browser-native inference at this quality was genuinely impossible until WebGPU stabilized. AULA is only buildable in 2026. The combination of Gemma 4's open weights, WebGPU's GPU access, and MediaPipe's optimized runtime is what makes a Pi-friendly AI tutor a real thing, not a thought experiment.

For 65 million students in Latin America who have been excluded from the AI revolution, this matters more than I can describe in this post.

Tech stack: Next.js 15, TypeScript strict, Tailwind v4, MediaPipe LLM Inference, WebGPU, Gemini API (REST + SSE), Zustand, IndexedDB, jsPDF, Mermaid, tesseract.js, Web Speech API.

Built solo in 11 days for the DEV.to Gemma 4 Challenge.

AULA is open source under MIT. Fork it, run it in your school, contribute to it. If you're a teacher in a low-connectivity region and want help deploying AULA, open an issue on GitHub.

🇨🇴 Made in LATAM, for the students the world forgot.

One backend, four products: why we bet on platform-per-brand

MD RASHEDUL ISLAM — Sat, 23 May 2026 04:06:24 +0000

by Rashed

We shipped an auth bandaid at 2am. Cookies wouldn't flow between platform.ginilab.com and our gateway, which was running under a different registrable domain. Browsers blocked them, correctly. The bandaid that unblocked the demo was a 5-minute bearer token held in a Zustand store on the frontend, attached by hand to every request. It worked.

Within 24 hours we'd shipped four PRs of cookie-domain workarounds. Then someone asked the obvious question: "why isn't api.ginilab.com just another hostname on the same gateway?" It was. We were deep into a problem we'd solved in a single DNS record.

That bug — cookie-domain mismatch in a multi-brand platform — is the one-paragraph version of why this post exists.

One caveat before we go further. v3 is in staging. It's not yet processing live payments. 300+ restaurants run on our legacy PHP/MySQL stack today, and the cohort migration hasn't started. What follows is the architecture we bet on and the pain we hit getting here, not a victory lap. If you want a "we scaled to a billion requests" story, this isn't it. If you want an honest mid-migration account from a small team, read on.

What "platform-per-brand" actually means

Ginilab is one backend platform that runs four products. Tomafood is restaurant ordering, with 300+ restaurants live on the legacy stack and the full rewrite to v3 in progress. CloudPOS is a POS for non-food retail. iSchool is school management. Ecommerce is generic ecommerce. Tomafood is the full rewrite. The other three consume shared services via REST or SDK. They aren't separate codebases. They aren't separate backends. They're different products mounted on the same platform.

This is unusual. Most SaaS teams either build one product and stay there, or build separate platforms per product when product two arrives. The shared-platform-across-products shape is the third path, and it has a tax: every architectural decision has to assume more than one consumer, more than one brand, more than one domain. The tax shows up early and never goes away.

We took the tax on purpose. We knew CloudPOS and iSchool were coming. Without that, this would have been overengineering.

[INSERT DIAGRAM 1 HERE — architecture sketch: four products on top, shared multi-hostname gateway in the middle, shared services below keyed by business_id + app_id, Tomafood-only restaurant-service off to the side.]

We rejected the obvious answer twice

The obvious answer when a second product appears is to fork. Take the codebase that works for product one, copy it, change the domain, run a second backend. Engineers know how to do this. It feels safe.

We rejected it twice. The first time was when CloudPOS came online and the temptation was to fork Tomafood's auth service and run it as a second backend behind the POS product. The second time was when iSchool was scoped and the temptation flipped: extract microservices per product, one stack per vertical. Both options were wrong for the same underlying reason. A customer who orders on Tomafood and later signs up for CloudPOS should be the same identity. Forking the auth service means reconciling those identities later. Per-product microservices means reconciling them four times.

The version that doesn't require reconciliation is one auth service, multi-tenant by design, with every shared service carrying both who the business is and which product they're using.

The design rule that makes it work

Every shared service in the platform — auth, addresses, payments, notifications, gateway — carries two identifiers on every query:

business_id is the specific business (UUID).
app_id is which product they're using: tomafood, cloudpos, ischool, and so on.

Not restaurant_id. There is no restaurant_id column anywhere in a shared service. restaurant_id is a Tomafood-only concept that lives only in the Tomafood product service.

The pair flows through JWT claims and is enforced at the repository layer. We say this in the CLAUDE.md at the root of the repo about as bluntly as we can:

Shared services NEVER use restaurant_id — always business_id + app_id.
Repository enforces WHERE business_id = ? on every query.
JWT claims include businessId + appId.

In practice that means a row in auth_db.users doesn't know what a restaurant is. It knows it belongs to a business, and the business runs on an app. A row in restaurant_db.recipes does know what a restaurant is, because restaurant_db belongs to Tomafood and restaurant_id is meaningful there.

The boundary is consistent. Shared services see businesses. The Tomafood product service sees restaurants. That sentence took us a long time to write down, and longer to enforce.

[INSERT DIAGRAM 2 HERE — decision tree: shared service? then business_id + app_id. restaurant-service? then restaurant_id is fine. Neither? then you're in the wrong file.]

The multi-brand pain, made concrete

The cookie story from the opener is what happens when "multi-brand" stops being an abstract design rule and becomes a Tuesday. Each restaurant on Tomafood can run on its own white-label domain — their brand, their registrable domain. The platform has its own brand. The gateway has to accept cookies from all of them.

The first version of the cookie-domain helper was a security hole. It checked host.includes('ginilab.com') to decide whether to set the cookie's domain attribute. A lookalike host like ginilab.com.evil.example would have passed that check. The second version checks suffix-with-leading-dot:

// packages/shared/src/cookies/pick-domain.ts
export function pickCookieDomain(
  origin: string | undefined
): string | undefined {
  if (!origin) return undefined;
  const host = new URL(origin).hostname;
  if (host.endsWith('.ginilab.com')) return '.ginilab.com';
  // ... plus one branch per brand registrable domain
  // Lookalike-domain defence: must end with the LEADING dot,
  // not just contain the string.
  return undefined;
}

A handful of lines. They exist because we have more than one brand on one platform. If we'd had one brand, this would have been a hardcoded constant. If we'd had four separate backends, each one would have hardcoded its own constant, and the bug would live in four places.

This is the smallest, ugliest example of the platform-per-brand tax. There are larger ones. They all have the same shape: a thing that would be a constant in a single-brand world becomes a function in a multi-brand one. The function is the cost.

What we picked, what we rejected, why

We picked one backend platform, multi-product, multi-brand. Shared services keyed by business_id + app_id. The Tomafood-only product service keeps restaurant_id. JWT carries both identifiers. The same gateway is exposed under per-brand hostnames so cookies flow.

We rejected one codebase per product — four backends, four auth services, four databases. This is the standard SaaS path and most teams' default. We rejected it because the products share customers and the reconciliation cost compounds. A user who shops on Ecommerce and orders on Tomafood and whose kid is on iSchool is one human. Four backends would turn that human into four accounts with four passwords and four address books, held together by sync code. We would be writing and maintaining that sync code for years.

We rejected microservices-per-product from day one. Per-vertical stacks, one platform-org per product. We rejected this because we're a small team and the operational surface scales with services rather than users. Splitting before a second consumer exists for any given surface is premature. Our restaurant-service today is a deliberate monolith — it contains menu, orders, kitchen, tables, drivers, reservations, and reviews in one deployable. We will split a surface out the moment a second consumer (CloudPOS, iSchool) needs that surface, and not before.

We gave up the freedom to ship a product-specific schema change without thinking about other products. Every shared schema change has to consider all current and plausible future consumers. That slows down week-to-week work. The bet is that it speeds us up over the lifespan of the platform.

What we got is more boring than it sounds: one auth service, one identity model, one set of secrets to rotate, and a single place to fix every helper.

The takeaway

Platform-per-brand is a bet on product-multiplication. We made it because we knew CloudPOS and iSchool were coming. If you only ever ship one product, this is pure overhead. Every shared-service decision costs more than it would in a single-product codebase, and you get none of the payoff. If you'll ship two, the difference is one team versus four. If you'll ship four, there is no version of this where fork-and-clone stays survivable.

Two questions worth sitting with before betting the same way. Do you know what product two looks like? Does it share customers with product one? If both answers are yes, the platform shape pays off. If either is no, it's overhead.

We'll come back to specific pieces of this in later posts — the idempotency middleware on money paths, the multi-zone CDN purge, the Valkey vs Redis pricing fight, the strategic monolith. Each is its own story. This post is the foundation. Every later decision in the series only makes sense because the platform shape was already chosen.

AI's tech debt is invisible — even to AI. I solved it at the architecture layer.

Aming — Sat, 23 May 2026 03:58:23 +0000

TL;DR — AI repeats your patterns badly, ignores existing services, and forgets every cross-session lesson you taught it. This isn't laziness — it's a new kind of tech debt: invisible, systemic, and architectural. Project memory hints don't scale. Bigger context windows don't help. The fix is structural: pin a graph projection of your codebase to every commit, let AI read it before writing, surface "graph stale" prompts when source drifts. Real commit receipts from my own OSS project aming-claw inline. Architects, change my mind in the comments.

What is AI tech debt?

Let me define this precisely, because it's a different beast from the tech debt you already know.

Dimension	Traditional tech debt	AI tech debt
Who creates it	Engineers (knowingly)	AI (unknowingly)
Awareness	Conscious tradeoff	AI doesn't know it's accruing
Fix lifecycle	Fix once, done	Every new session repeats it
Visibility	`git log` shows it	Invisible across sessions
Scale	Team-bounded	Systemic, AI-generated

The core asymmetry: the more your team uses AI for coding, the more invisible debt accrues — and you have no tool that sees it.

5 symptoms (diagnose yourself)

Run this checklist against your team:

❌ AI re-implemented a service that already exists
❌ AI shipped code using a pattern completely inconsistent with everything around it
❌ AI didn't see the implementation sitting in the next file over
❌ Every new session repeats the same mistakes you corrected last time
❌ AI treats a familiar codebase as if it were brand new

Three or more? You're accruing AI tech debt. The bigger your team and the more AI you use, the faster it compounds.

A real case study: my toolboxclient stateService

I'm the maintainer of toolboxclient (open-source cross-platform AI agent runtime, 274+ stars). I asked AI to add a stateService.

The directory server/services/ already contained, in clear sight:

TOOLBOXCLIENT/server/services/
├── fingerPrintService.js
├── memoryService.js
├── providerModelService.js
├── proxyService.js
├── taskService.js
├── toolServiceManager.js
├── walletService.js
└── webSocketService.js

Roughly a dozen services, all sharing the same HTTP pattern.

What AI shipped (commit 68487cc, 2026-03-19):

// AI's version: WebSocket-based StateClient with Proxy
class StateClient {
  constructor(agentName) {
    // 🚨 WebSocket, not HTTP — inconsistent with every other service in the folder
    this.ws = new WebSocket(...)
    this._data = {}
    this.state = this._createProxy()
  }

  _createProxy() {
    // Proxy traps to broadcast via WebSocket
    return new Proxy(this._data, { ... })
  }
}

It used WebSocket instead of HTTP. It used a Proxy-based intercept-and-broadcast pattern unlike anything else in the codebase. It built a parallel architecture next to an established one.

This wasn't a code bug. It was a pattern bug. AI literally couldn't see the existing services.

The first fix: project memory

My first instinct: add a hint to project memory.

use existing HTTP services, don't add WebSocket

AI refactored cleanly (commit bbdf82c, 2026-03-21):

feat: stateService Phase A+B — HTTP CRUD + SSE broadcast

Phase A: /api/state/* routes (read, write, session CRUD, language pref)
Phase B: SSE subscribe endpoint with topic filtering + EventBus broadcast

74/74 tests pass. No breaking changes — additive only.

WebSocket gone. HTTP CRUD + SSE matching the existing pattern. Clean fix.

For about ten seconds, I thought I'd solved it.

Why project memory hints don't scale

Then I realized something uncomfortable:

This catch only worked because I noticed.

The next AI session would start with zero memory of this lesson.
Every context window starts as a blank slate.

This is the systemic nature of AI tech debt:

AI can't see existing patterns when it writes
I see it → I fix it once → the fix doesn't propagate to future sessions
Manual project memory maintenance puts the work back on me, not AI
This doesn't scale — and the failure mode is silent

The first insight

I stopped trying to fix prompts and started looking at the structural problem:

AI agents don't need bigger context windows.

They need a persistent structural record of the project that survives across sessions.

Context windows are short-term memory. What's missing is long-term, project-level memory — something any AI session can read before writing.

This is the insight that turned into aming-claw.

Building aming-claw (and falling into the next trap)

The idea: give every AI session a queryable graph of the project. Files, modules, functions, patterns — all of it, machine-readable, persistent.

Scan the codebase → build a graph of all entities and relations
Expose it through an MCP server that any agent can query
AI reads the graph before writing
Graph persists across sessions

I built it. It worked. Then it broke — at a higher layer.

I had implemented the graph with:

Mutable nodes — agents could edit graph state directly
A patch pipeline — 5-stage mutation flow (propose → validate → review → apply → snapshot)
A graph editor UI — humans could also edit the graph

Within a few weeks, the graph drifted from the actual code.

Why? Because I had created a second source of truth:

The real source of truth was source code
But I also let the graph be directly mutated
The two sources inevitably diverged

Same trap. Higher layer.

The real architectural insight

After hitting the same trap twice, the answer crystallized:

~~The graph is something you edit.~~

The graph is a projection of the commit.

In concrete terms:

Every commit can correspond to one graph

git commit (modifies source / hints / config)
     ↓
system detects: HEAD ≠ graph's bound commit
     ↓ ⚠️ "graph stale" prompt
user decides when to reconcile
     ↓ user-triggered
fixed_algorithm(source + hints + config)
     ↓
new graph snapshot ←→ new commit hash

4 key invariants

#	Invariant	What it guarantees
1	Fixed algorithm	Same input → same graph (deterministic, no randomness)
2	1:1 binding	Every commit hash maps to exactly one graph snapshot
3	User-triggered	Reconciliation is explicit, not a background git hook
4	Stale prompt	System surfaces drift in dashboard / CLI; user triggers when ready

Why not a git hook?

A reasonable question: why not auto-rebuild the graph on every commit via a git hook?

Three reasons I deliberately didn't:

Reconciliation is expensive (full codebase scan + algorithm)
Surprise auto-builds destabilize state — user should control when state changes
Batching commits before a single reconcile is often what users want

The system shows a graph stale indicator in dashboard and CLI. Users reconcile when they're ready. This is a deliberate design choice, not a limitation.

How modification and rollback work

Operation	Implementation
Modify the graph	Modify source / hints / config → trigger reconcile
Roll back the graph	`git revert` → trigger reconcile
Verify consistency	Same commit → same graph (replayable)

Logic lives in code. The graph is a read-only projection.

How this solves AI tech debt

Returning to the original problem: AI repeats patterns badly because it can't see the codebase.

The architectural fix:

Every AI session starts by querying the graph (via MCP)
The graph records the full structure — files, functions, modules, patterns
AI sees, for example, existing HTTP service pattern in server/services/
AI reuses the pattern instead of shipping a parallel WebSocket implementation
After AI makes changes → user commits → system flags graph as stale → user reconciles → next session sees updated patterns

Cross-session knowledge transfer happens through the graph, not the prompt.

This is what "solved at the architecture layer" means: it's not a smarter prompt, it's a different topology of state.

Coming up: the algorithm itself

This post covered why the projection model works. The next post covers how the algorithm builds the graph:

in-degree=0 entry detection
DFS 3-color marking
Tarjan SCC for cyclic clusters
6-signal layer scoring
Cross-language fact pipeline (Python + TypeScript)

Follow me here to catch the next one.

Change my mind

I claim this architectural pattern solves AI tech debt: every commit corresponds to one graph + user-triggered reconcile + stale-state prompt.

Your turn. Two architectural choices:

Treat project state as a single source of truth, commit-bound
Or maintain a separate memory store that AI writes to

Which is more robust? Which scales better? Where would you attack my approach?

Calibrated invitation: I want senior engineers and AI infra people to push back with specifics. "What about X?" or "Have you considered Y?" lands better than "this won't work." If you've shipped something adjacent, tell me — I want to compare designs.

Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals

toshihiro shishido — Sat, 23 May 2026 03:55:12 +0000

"ROAS 300%, so we're profitable." I've seen this line in dozens of internal EC reports — and in maybe half of them, the business was actually losing cash. The trap is gross margin. For a 30%-margin product, ROAS 300% is barely above breakeven. Same ROAS, different margin, opposite conclusion.

This post walks through why ROAS alone is a misleading profitability signal, what gross margin actually is, where typical EC verticals land (15–75%), and the 3-step method I use to measure it from real data.

TL;DR

Gross margin = (revenue − COGS) ÷ revenue × 100. Business decisions run on gross profit, not revenue
EC gross margins span 15–75% by vertical (cosmetics 60–75%, electronics 15–25%)
Breakeven revenue = fixed costs ÷ gross margin. Double the margin and required revenue is halved
Breakeven ROAS = 1 ÷ gross margin × 100. Judging profitability on ROAS alone is dangerous
Measure your own gross margin in 3 steps — define COGS, take a sales-weighted average, validate against industry benchmarks

1. Why ROAS Without Gross Margin Is Misleading

ROAS 300% means "$3 of revenue per $1 of ad spend." That's revenue, not profit. Plug in different gross margins and the conclusion flips.

30% margin → gross profit of $0.90 against $1.00 ad spend = a $0.10 loss per $1 ad
50% margin → gross profit of $1.50 against $1.00 ad spend = a $0.50 profit per $1 ad
70% margin → gross profit of $2.10 against $1.00 ad spend = a $1.10 profit per $1 ad

The same ROAS produces three different business outcomes depending on the underlying gross margin. Reading ROAS in isolation is the most common source of overspending on ads in low-margin verticals.

2. What Gross Margin Actually Is

Gross margin shows how many cents of every revenue dollar remain as gross profit, after subtracting the cost of goods sold.

Gross margin (%) = (revenue − COGS) ÷ revenue × 100

For EC, the standard COGS bucket includes purchase cost of goods (or manufacturing cost), inbound shipping, direct packaging materials, and payment processing fees. SG&A (ad spend, payroll, fulfillment outsourcing, office rent) sits outside gross margin — it goes into operating margin further downstream. The most common mistake is dumping ad spend into COGS, which artificially depresses gross margin.

3. Five EC Vertical Benchmarks

EC gross margins span 15–75% across verticals. The product structure is fundamentally different even though everything gets labeled "ecommerce."

The numbers are reference ranges — in-house brands vs. resellers, full-price vs. sale-driven operations move them up or down. The important point is that each vertical has its own correct range. A consumer-electronics EC chasing 60% margin is unrealistic; a cosmetics EC running at 30% probably has something miscounted.

Benchmarks are reference points, not targets. The actual decision is whether your own margin sits within the band that the vertical's product economics allow.

4. Breakeven Falls Out Once Margin Is Locked

Once gross margin is locked, two breakeven numbers fall out immediately.

Breakeven revenue = fixed costs ÷ gross margin
Breakeven ROAS    = 1 ÷ gross margin × 100

Breakeven ROAS by margin:

20% margin → 500% breakeven ROAS
30% margin → 333% breakeven ROAS
40% margin → 250% breakeven ROAS
50% margin → 200% breakeven ROAS
60% margin → 167% breakeven ROAS
70% margin → 143% breakeven ROAS

A consumer-electronics EC at 20% margin needs ROAS 500% just to break even. A cosmetics EC at 70% margin only needs 143%. The same "ROAS 300%" headline number is a guaranteed loss for one and a strong profit for the other. Every ad-budget decision starts from confirming gross margin first.

5. Three Levers to Improve Margin

Margin improvement has three levers, in priority order — pricing > product mix > COGS negotiation.

Pricing is the fastest lever. A 3% price increase with constant unit volume adds 3 percentage points directly to margin. Even with some churn, price elasticity above −1.0 (demand doesn't drop sharply on price increases) makes the lift net-positive on total gross profit.

Product mix moves the sales-weighted average margin by lifting the share of high-margin SKUs. Cross-sell flows that attach a high-margin item, subscriptions anchored on high-margin repeat goods, and bundles built around the higher-margin SKU are the standard plays.

COGS negotiation sits on the supplier side — unit-price negotiation, fulfillment efficiency, packaging optimization. The effect is slow, capped by supplier relationships, and best run on an annual review cycle. Bigger purchase lots trade margin against inventory risk, so this is only sensible once AOV and repeat rate are stable.

6. Measuring Your Gross Margin in 3 Steps

The formula is simple, but producing your own number and running operations against it is separate work. A 3-step method to get a current number into operations.

Step 1 — Define COGS

Fix the COGS bucket internally to the four standard items (purchase cost + inbound shipping + direct packaging + payment fees). SG&A stays out.

Step 2 — Take a sales-weighted average across SKUs

With multiple SKUs, compute the per-SKU margin and weight by revenue, not by unit count. Revenue weighting captures high-AOV products correctly.

Sales-weighted average margin = Σ (SKU i gross profit × SKU i revenue) ÷ Σ (SKU i revenue)

Reconcile GA4 e-commerce events (the purchase event's value parameter) against your internal sales system once a month. GA4 alone won't give you margin (COGS isn't in GA4) — the reconciliation step is the unavoidable part.

Step 3 — Validate against the industry benchmark

Compare to the §3 vertical ranges. Within ±10 percentage points is normal; bigger gaps need investigation.

Below industry average — high purchase cost, heavy discounting, excessive inventory loss
Above industry average — brand-led pricing, in-house manufacturing, restrained discounting

Once the gap is explainable, gross margin is locked, and breakeven revenue and breakeven ROAS fall out.

Wrap-up

Gross margin is upstream of every other profitability lever. EC verticals span 15–75%, so the same ROAS produces opposite conclusions depending on the underlying margin. Reading ROAS without anchoring to margin is the most common source of overspending in low-margin verticals.

The 3-step measurement — define COGS, weight by sales, validate against benchmarks — is the entry point. Once gross margin is locked, the rest of the financial decisions fall out almost mechanically.

How do you currently anchor your ad-budget decisions — pure ROAS, breakeven ROAS by margin, or something blended with LTV?

Originally posted on RevenueScope.

References

Ministry of Economy, Trade and Industry “FY2024 E-Commerce Market Survey” August 2025
Shopify “Ecommerce statistics 2024” 2024
Baymard Institute “Product Page UX Research” 2024

You Don’t Need to Try Every AI Tool to Keep Up

Dechive — Sat, 23 May 2026 03:50:18 +0000

A practical note on AI tool anxiety, productivity pressure, and choosing better standards.

Every developer feed has started to feel like a speedrun.

Someone built an app with AI over the weekend.
Someone launched a small SaaS.
Someone connected a new model to an agent workflow.
Someone tested the latest coding assistant and already has a thread about it.

Then the quiet question appears:

Am I falling behind?

It does not always feel like failure.

Sometimes it feels like absence.

I am not necessarily doing something wrong.
I am just not doing enough.

Not building enough.
Not testing enough.
Not automating enough.
Not using the newest tools quickly enough.

In the age of AI, that feeling can become exhausting.

But before we accept it as truth, we should ask a better question:

What standard am I using to decide that I am behind?

Two types of AI anxiety

I think AI anxiety often shows up in two forms.

1. Productivity anxiety

This is the feeling that everyone else is producing more with AI.

They are writing faster, coding faster, launching faster, publishing faster, and turning small ideas into visible projects faster than before.

The feed keeps showing a version of:

I built this with AI.

So if I am not building something too, it can feel like I am wasting time.

2. Tool anxiety

This is the feeling that every new model, framework, agent, editor, or workflow needs to be tested immediately.

A new model comes out.
A new AI coding tool gets attention.
A new automation pattern spreads.
A new “best workflow” appears.

Someone has already tried it.
Someone has already compared it.
Someone has already connected it to five other tools.

So the question becomes:

If I am not using all of this, am I falling behind?

Both anxieties feel real.

But both depend on comparison.

Trying a tool is not the same as keeping up

Here is the mistake I keep noticing:

We confuse trying a tool with moving forward.

But these are different things.

Trying a tool quickly is not the same as understanding it.

Understanding a tool is not the same as using it well.

Using a tool well is not the same as building something meaningful with it.

The first person to test a new model is not automatically the person who understands it best.

The person who connects many tools together is not automatically solving a better problem.

The person who launches faster is not always moving in a better direction.

In the AI era, activity can easily disguise itself as progress.

That does not mean we should ignore new tools.

Experimentation matters.
Curiosity matters.
Trying new models can reveal what is changing.

But a tool is not a direction.

A model is not a goal.

A workflow is not a standard.

The feed is not a good standard

The feed is good at showing motion.

It is not always good at showing meaning.

It shows:

Who launched something
Who tried the newest model
Who built a workflow
Who automated a task
Who shipped faster
Who got attention

But it does not always show:

Whether the tool actually solved a real problem
Whether the workflow is maintainable
Whether the output was useful
Whether the project will survive next week
Whether the person building it even needed it

That is why using the feed as a standard is dangerous.

The feed can always move the finish line.

After you try one tool, another one appears.
After you launch one project, someone launches three.
After you automate one workflow, someone shows a better one.

If the standard stays outside of you, no tool will be enough.

A better checklist before trying a new AI tool

Before trying a new AI tool, I want to ask better questions.

Not because tools are bad.

But because attention is limited.

Here is the checklist I want to use.

1. Why do I want to use this?

Is it curiosity?

Is it connected to a real problem?

Or am I only reacting because everyone else seems to be using it?

2. What problem does it solve?

A tool should be connected to a problem.

If I cannot name the problem, I am probably just collecting tools.

3. What would count as a useful result?

Before using the tool, I should know what “better” means.

Does it save time?
Does it improve quality?
Does it reduce friction?
Does it help me understand something?
Does it help me build something I actually care about?

4. What will I stop doing if this works?

This question is important.

If a tool does not change anything about how I work, maybe it is not actually useful yet.

A useful tool should replace, improve, or clarify something.

5. Am I curious, or am I anxious?

Curiosity and anxiety can look similar.

Both can make us test tools.
Both can make us write notes.
Both can make us post screenshots.

But they feel different internally.

Curiosity builds judgment.

Anxiety borrows direction.

A practical example

Instead of saying:

I need to try this new AI coding tool because everyone is talking about it.

I want to say:

I want to test this tool because I spend too much time refactoring repeated UI patterns, and I want to see if it can reduce that friction without lowering code quality.

That second sentence has a standard.

It has a problem.
It has a reason.
It has something to verify.

The goal is not just to use the tool.

The goal is to find out whether the tool helps with a real task.

That difference matters.

Keeping up does not mean using everything

Keeping up with AI does not mean using every new model, framework, agent, or workflow.

It means building the judgment to decide what is worth using.

It means knowing why we are trying something before we mistake the act of trying for progress.

It means knowing what we are building before we confuse output with direction.

AI can make us faster.

But speed only helps when we know what it is serving.

Without an internal standard, every new tool becomes a demand.

Every launch becomes a comparison.

Every post becomes evidence that we are late.

With a standard, a tool can become just a tool again.

Something to test.
Something to use.
Something to ignore.
Something to return to later.

Maybe falling behind in the age of AI is not always about using fewer tools.

Maybe it is often about borrowing too many standards from the feed.

Originally published on Dechive — an archive for verifying AI-generated answers before we trust them.

https://dechive.dev/en/archive/am-i-falling-behind-in-ai-era

NovelPilot: A Novel Writing Agent Powered by Gemma 4

Doraking — Sat, 23 May 2026 03:50:06 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Most AI story generators work like this:

prompt in → wall of text out

That is useful, but it does not feel like a real writing process.

When people write fiction, they do not only generate paragraphs. They plan the premise, design characters, build the world, structure the plot, manage foreshadowing, write scenes, edit style, check continuity, and prepare the final piece for readers.

So I built NovelPilot.

NovelPilot is a Gemma 4-powered AI writing room that turns one prompt into a complete story creation pipeline.

One prompt goes in.

Nine agents start working.

A finished story comes out.

What I built

NovelPilot is a web app that helps users create short fiction through a structured multi-agent workflow.

The user starts with a simple prompt, such as:

Write a melancholic sci-fi mystery set in modern Tokyo. A graduate student who lost his memory investigates a disappearance in a quantum computing lab.

Then NovelPilot launches a sequence of specialized AI agents:

Premise Architect
Character Director
World Builder
Plot Strategist
Chapter Architect
Prose Writer
Style Editor
Continuity Detective
Publisher Agent

Each agent performs a specific part of the writing process.

The result is not just a generated story. It is a full creative package:

Story concept
Character profiles
Worldbuilding notes
Plot structure
Chapter outline
Chapter 1 draft
Style editor report
Foreshadowing tracker
Continuity detective report
Title ideas
Publication summary
Browser reading mode
Polished PDF export

NovelPilot is designed to demonstrate Gemma 4 as a multi-agent creative reasoning engine, not just a text completion model.

Demo

Live demo: https://novelpilot.vercel.app

How to try it:

Open the live demo.
Click Run Judge Demo.
Watch the nine-agent pipeline complete.
Read the finished novel in the browser.
Review the Foreshadowing Tracker and Continuity Detective.
Download the final story as a polished PDF.

The Judge Demo works without an API key, so reviewers can test the full experience immediately.

For live generation, NovelPilot supports Gemma 4 through a provider abstraction, with OpenRouter as the recommended provider.

Sample prompt and output

Here is the sample prompt I used to test NovelPilot.

The protagonist is Ren Kanzaki, a 24-year-old graduate student working in a quantum computing laboratory. A few days ago, he lost part of his memory. He cannot remember what he was researching, why his professor suddenly disappeared, or why his own name appears in an old experimental log.

The story begins on a rainy night in Tokyo. Ren enters the university research building after midnight and finds an old experiment log hidden inside a locked drawer. On the final page, he sees the sentence:

“Ren Kanzaki will be removed from the observation target as of today.”

The story should focus on quiet tension, memory gaps, emotional unease, and the unsettling atmosphere of the laboratory. Avoid flashy action. Let the mystery emerge through scenery, silence, dialogue, and small contradictions.

Main theme:
If memories disappear, can a person still remain the same self?

Main characters:
- Ren Kanzaki: A graduate student who lost part of his memory. Calm and intelligent, but emotionally repressed.
- Mio Shiraishi: Ren’s labmate. She knows something about Ren’s memory loss but refuses to tell him the truth.
- Professor Kuon: The missing professor. He was researching quantum memory transfer.
- Associate Professor Kurosaki: The person currently managing the laboratory. He seems helpful, but some of his statements contradict the records.

Tone:
Intellectual, quiet, melancholic, slightly literary, and mysterious.

Language: en
Genre: sci-fi
Tone: melancholic
Target Length: short-story

I also exported the generated story as a polished PDF.

Sample output PDF: Download the generated novel PDF

This PDF was generated directly from NovelPilot’s finished reader view.

Code

GitHub repo: https://github.com/dorakingx/novelpilot

Tech stack:

Next.js App Router
TypeScript
Tailwind CSS
shadcn/ui-style components
Gemma 4 provider abstraction
OpenRouter-compatible live mode
Mock mode for the zero-setup judge demo
Browser-based polished PDF export
Vercel deployment

The app has two main modes:

Mode	Purpose
Demo / Mock Mode	Lets judges try the full workflow without an API key
Live Mode	Uses Gemma 4 through the configured provider

The provider layer is intentionally isolated in lib/gemma.ts, so the model provider can be changed without rewriting the app.

How I used Gemma 4

Gemma 4 is the reasoning engine behind the multi-agent writing pipeline.

NovelPilot uses Gemma 4 for:

structured story concept generation
character design
worldbuilding
plot planning
chapter outlining
prose drafting
style editing
foreshadowing tracking
continuity auditing
publisher copy generation

Each agent receives the accumulated story bible and previous structured outputs.

This means Gemma 4 is not just generating paragraphs. It acts as the structural memory and reasoning layer for the whole novel creation process.

The important design decision was to make every agent return structured data whenever possible. That allows the UI to render the model output as real product features: timelines, cards, reports, trackers, reader views, and exports.

Why I chose this Gemma 4 model

For the live version, NovelPilot is designed to use a Gemma 4 model through OpenRouter.

I chose this approach because the app needs strong reasoning and structured generation across multiple steps. The model must follow JSON schemas, preserve context from earlier agents, and reason about story structure, character consistency, and foreshadowing.

NovelPilot focuses especially on:

long-context creative reasoning
structured JSON generation
story memory across multiple steps
continuity checking
literary planning and drafting

Gemma 4 is a good fit because the project is not only asking the model to write a paragraph. It asks the model to behave as a coordinated writing room.

What makes NovelPilot different

Most AI writing tools generate text.

NovelPilot generates a writing process.

The user does not only receive a draft. They see how the story is built:

Prompt
  ↓
Premise
  ↓
Characters
  ↓
World
  ↓
Plot
  ↓
Chapter outline
  ↓
Draft
  ↓
Style edit
  ↓
Continuity audit
  ↓
Publisher package
  ↓
Reader view
  ↓
PDF export

This makes the output easier to inspect, revise, and trust.

Key feature: Foreshadowing Tracker

One of my favorite parts is the Foreshadowing Tracker.

Instead of only writing a draft, NovelPilot tracks story threads like this:

{
  "item": "The cracked silver watch",
  "introducedIn": "Chapter 1",
  "status": "unresolved",
  "suggestedPayoff": "It reveals the exact time the protagonist's memory was overwritten.",
  "payoffChapter": "Chapter 3",
  "emotionalPurpose": "Connects guilt, identity, and lost time."
}

This makes the output more useful for writers.

It also shows why a structured model workflow matters. The app is not only asking Gemma 4 to write prose. It is asking Gemma 4 to reason about narrative structure.

Key feature: Continuity Detective

The Continuity Detective checks the generated story for structural problems.

It returns issues with:

category
severity
evidence
suggested fix

Example structure:

{
  "category": "foreshadowing",
  "severity": "high",
  "issue": "The experiment log is introduced as important but has no planned payoff.",
  "evidence": "The log appears in Chapter 1 and is referenced in the outline, but no chapter resolves its origin.",
  "suggestedFix": "Reveal in the final chapter that the log was written by an earlier version of the protagonist."
}

This was important to me because many AI writing tools can generate plausible fiction, but fewer tools help the user understand whether the story actually holds together.

Final reader experience

After all agents finish, NovelPilot automatically transitions into a Completed Novel Reader.

The user can read the finished story directly in the browser.

They can also go back to the Agent Workspace to inspect:

agent outputs
story bible
foreshadowing tracker
continuity report
publisher package

The final reader is not a one-way screen. Users can freely move between the production workflow and the finished novel.

PDF export

I also added polished PDF export.

Instead of relying on the browser’s default print layout, NovelPilot generates a designed A4-style manuscript PDF.

The PDF includes:

cover page
novel title
metadata
chapter title
formatted manuscript body
optional story notes

This makes the app feel closer to a complete writing product, not just a demo.

UI/UX design

I wanted the app to feel like an AI creative studio.

The flow has three stages:

1. Prompt Launcher

The first screen is focused.

The user only sees:

prompt input
language
genre
tone
target length
Generate Story
Run Judge Demo

This keeps the experience simple.

2. Agent Workspace

After generation starts, the app transitions into the agent workspace.

This screen shows:

active agent timeline
story bible
foreshadowing tracker
manuscript preview
continuity detective
export tools

3. Completed Novel Reader

When all agents finish, the app opens the final reading screen.

The user can read the story, download a PDF, or go back to review the agent outputs.

Technical architecture

The core architecture is simple:

app/page.tsx
  Main app phase control:
  launcher → workspace → reader

lib/useStoryProject.ts
  Client-side orchestration of the pipeline

app/api/generate-agent/route.ts
  Runs one agent per request

lib/gemma.ts
  Provider abstraction for Gemma 4 / OpenRouter / mock mode

lib/prompts.ts
  Prompt templates for each writing agent

lib/agents.ts
  Merges structured agent outputs into the Story Bible

lib/types.ts
  Shared TypeScript types

components/
  Prompt launcher, agent workspace, reader, trackers, reports, export panels

The app uses a state-first architecture because this is a hackathon project. I intentionally avoided authentication, databases, and user accounts so the core experience stays fast and easy to judge.

Agent workflow

Here is the high-level pipeline:

User Prompt
  ↓
Premise Architect
  ↓
Character Director
  ↓
World Builder
  ↓
Plot Strategist
  ↓
Chapter Architect
  ↓
Prose Writer
  ↓
Style Editor
  ↓
Continuity Detective
  ↓
Publisher Agent
  ↓
Completed Novel Reader + PDF Export

Each step builds on the previous one.

For example, the Character Director does not work from the original prompt alone. It receives the premise and theme created by the Premise Architect.

The Plot Strategist receives the concept, characters, and worldbuilding.

The Continuity Detective receives the story bible, chapter outline, draft, and previous reports.

This makes the app feel like an actual production pipeline rather than a single model call.

What I learned

The biggest lesson was that structured outputs are more powerful than plain prose outputs for creative tools.

A single prose response is hard to inspect.

But structured outputs can become:

timelines
cards
story bibles
trackers
reports
reader views
exports

I also learned that judge experience matters.

That is why I added Run Judge Demo. Reviewers can experience the full product without configuring an API key.

Another lesson was that a creative AI product should not end at “generation complete.” It should end with something the user can actually consume. That is why I added the final reader and PDF export.

Challenges

The biggest challenge was balancing autonomy and control.

If the app is too automatic, it feels like the user has no creative role.

If the app asks for too much input, it stops feeling agentic.

So I designed NovelPilot around this principle:

The AI agents do the heavy lifting, but the user can always review, regenerate, edit, read, and export.

Another challenge was making the final output feel complete. The Completed Novel Reader and PDF export helped turn the generated draft into something closer to a finished product.

What’s next

I would like to add:

full multi-chapter generation
persistent projects
local storage
streaming agent output
genre-specific prompt packs
vertical Japanese reading mode
richer PDF themes
user-editable story bible
side-by-side draft revision

Final thoughts

NovelPilot is my attempt to make AI fiction generation feel less like a chatbot and more like a writing room.

The core idea is simple:

One prompt. Nine agents. A complete story pipeline.

Gemma 4 is the reasoning engine behind the process. It plans, writes, edits, tracks foreshadowing, checks continuity, and packages the final story.

That is what makes NovelPilot more than a story generator.

It is an AI-powered novel production studio.

BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox

Guyoung Studio — Sat, 23 May 2026 03:49:57 +0000

BoxAgnts is an open-source AI Agent ToolBox built with Rust, dedicated to delivering an ultimate out-of-the-box experience. Leveraging WebAssembly sandbox, it provides a runtime environment that balances security and flexibility, helping users effortlessly tackle a wide range of complex tasks and thus becoming an efficient and trustworthy personal AI assistant.

Core Architecture

🎯 AI Agent ToolBox

BoxAgnts is a fully-featured AI Agent toolkit providing:

Multi-model support: Compatible with major AI model providers including OpenAI, Anthropic, CodeX, Google, Deepseek, MiniMax, OpenCode
Tool system: Built-in file operations, web access, code execution, and many other tools
Skill system: Create specialized AI skills through simple configuration

🛡️ WebAssembly SandBox

Build a secure runtime environment using WebAssembly technology:

Isolated execution: All custom tools and skills run in a WASM sandbox
Security control: Fine-grained permission management and network access control
Cross-platform: Compile once, run everywhere
High performance: Based on Wasmtime runtime, near-native performance

✨ Out of the Box

Out-of-the-box experience:

Zero-configuration startup: Download and run, no complex configuration
Web interface: Built-in beautiful Dashboard for visual management of all features
Built-in extensions: Pre-configured with commonly used tools and skills, ready to use
Quick start: Simple API and intuitive workflow

Key Features

🤖 AI Chat and Agents

Chat with multiple AI models
Create and manage custom Agents
Save and manage chat history
Support for streaming responses

🔧 Tool Execution

File read/write and editing
Shell command execution
Web content scraping
Code review and analysis

📦 Skill System

Quickly create specialized skills
Skill combination and reuse
Built-in skills including code review, weather query, front-end component generation, etc.

⏰ Automatic Tasks Cron

Create and manage scheduled tasks
Support for standard Cron expressions
Task execution logs and status tracking
Flexible task configuration and triggering methods

🌐 Web Service

Custom website deployment
Static file serving
API endpoint management

Quick Start

Download Executable

Download the latest compressed package from the Releases page, extract and run.

Start Service

# Start service
boxagnts

# Specify workspace directory
boxagnts --workspace-dir /path/to/workspace

# Specify port
boxagnts --workspace-dir /path/to/workspace --port 30002

Suggestion: BoxAgnts supports multiple workspaces, each with its own configuration file and data directory. It is recommended not to run in the default directory, but to specify a workspace directory or workspace-dir.

Command line arguments:

BoxAgnts is an open-source AI Agent ToolBox built with Rust.

Usage: boxagnts [OPTIONS]

Options:
      --port <PORT>          Port to run the web server on [default: 30001]
      --host <HOST>          Host to bind to (0.0.0.0 for all interfaces) [default: 127.0.0.1]
      --workspace-dir <DIR>  Set workspace dir, default current dir
      --app-dir <DIR>        Set app dir, default Boxagnts executable file dir
      --admin-user <USERNAME>  Set admin username
      --admin-pass <PASSWORD>  Set admin password
  -h, --help                 Print help
  -V, --version              Print version

Access Dashboard

Open your browser and visit http://127.0.0.1:30001

Configure Model

Add AI models and API Keys in the settings page

Project Structure and Source Code Compilation

This project is developed based on claurst project code

Directory Structure

boxagnts/
├── boxagnts/                 # Rust backend core code
│   ├── api/                 # AI model API (multi-provider support)
│   ├── core/                # Core types, constants, and basic functions
│   ├── gateway/             # API gateway (includes Cron task scheduling)
│   ├── mcp/                 # MCP protocol implementation (optional)
│   ├── server/              # Web server and Dashboard interface
│   ├── tools/               # Tool system and built-in tools
│   ├── tools-manager/       # Tool manager
│   ├── query/               # Query orchestration
│   ├── wasm-sandbox/        # WebAssembly sandbox runtime
│   ├── wasm-tools/          # WASM tool wrappers
│   └── workspace/           # Workspace and configuration management
├── boxagnts-dashboard-web/  # Vue 3 frontend source code
│   ├── src/
│   │   ├── api/            # API interface wrappers
│   │   ├── components/     # Vue components
│   │   ├── composables/    # Composables
│   │   ├── stores/         # Pinia state management
│   │   ├── views/          # Page components
│   │   └── router/         # Router configuration
│   └── package.json        # Frontend dependencies
├── app/                     # Application resources
│   ├── dashboard-web/      # Compiled web interface static assets
│   └── extensions/         # Extensions (tools/skills)
└── Cargo.toml              # Rust workspace configuration

Backend Code Analysis

The backend is developed in Rust using Tokio async runtime. The main modules are:

api/: Wraps APIs from multiple AI providers including OpenAI, Anthropic, Google, Azure, Bedrock, providing unified interface calling and message format conversion
core/: Defines core data types, constants, error handling, and system prompts
gateway/: API gateway layer, handles HTTP requests, includes Cron task scheduling system (cron/ subdirectory), supporting scheduled task creation, management, and execution
server/: Web server, providing Dashboard REST API and WebSocket support
tools/: Tool system, implements execution framework for built-in tools and skills
wasm-sandbox/: WebAssembly sandbox based on Wasmtime, implementing secure code execution environment
workspace/: Workspace management, handles configuration, authentication, and history storage

Frontend Code Analysis

The frontend uses Vue 3 + TypeScript + Vuetify technology stack:

Uses Pinia for state management (stores/ directory)
Uses Vue Router for routing management (router/ directory)
Main pages: Chat, Agents, Cron tasks, Files, Skills, Tools, Sites, Settings, etc.
Supports Markdown rendering, code editor (CodeMirror), charts (Chart.js), etc.
Communicates with backend via REST API and WebSocket

Source Code Compilation Method

Environment Requirements

Rust 1.75+ (Install: https://www.rust-lang.org/tools/install)
Node.js 18+ (Install: https://nodejs.org/)
npm or pnpm

Compile Backend

# Enter project root directory
cd boxagnts-pub

# Compile Debug version
cargo build

# Compile Release version (optimize for size and performance)
cargo build --release

# Compiled executable is located at target/release/boxagnts

Compile Frontend

# Enter frontend directory
cd boxagnts-dashboard-web

# Install dependencies
npm install

# Start development mode (hot reload)
npm run dev

# Compile production version
npm run build

# Compiled static files will be output to app/dashboard-web/

Complete Build Process

# 1. Compile frontend
cd boxagnts-dashboard-web
npm install
npm run build

# 2. Compile backend
cd ..
cargo build --release

# 3. Run
./target/release/boxagnts

License

MIT

Repository: https://github.com/guyoung/boxagnts

Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside.

Prakhar Shukla — Sat, 23 May 2026 03:49:50 +0000

Gemma 4 Challenge: Write about Gemma 4 Submission

Prakhar Shukla

May 17

Gemma 4: From Raspberry Pi to Research Workstation — One Architecture, No Quality Compromise

#devchallenge #gemmachallenge #gemma

Comments

13 min read

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Thousand Miles AI — Sat, 23 May 2026 03:39:57 +0000

Speculative decoding has been the rumored 3-5x throughput multiplier for about 18 months. The numbers have stayed muddled because most of the public benchmarks ride on H100s with batch sizes greater than one, where the speedup gets folded into pricing tables nobody outside a serving team reads. What teams running a single workstation actually measure has been harder to find.

The BeeLlama v0.2.0 release pins down a specific point on that map. The setup is small enough to reproduce in a weekend: one RTX 3090, 32 GB of DDR4, a Ryzen 7 5700X3D, and llama.cpp build b9275 as the baseline. The two target models are Qwen 3.6 27B at Q5_K_S and Gemma 4 31B at the same quantization. The drafter for each is a Q4_K_M DFlash variant. The benchmark prompts and configs are pinned in the README and the GGUFs are on Hugging Face under Apache 2.0.

The Qwen row is the easier of the two to read. Baseline llama.cpp turns out 37.2 tokens per second on a ~1K-token completion task. BeeLlama's DFlash path runs the same prompt at a 163.9 tok/s median, with a best run of 181.9. That is a 4.40x median multiplier on a card that costs around $700 used. The Gemma 4 31B row reports an even larger ratio: 36.1 tok/s baseline against 177.8 tok/s median, a 4.93x multiplier on a model that is 15% larger than the Qwen. The pattern — bigger model, slightly more speedup — is consistent with what speculative decoding theory predicts, because the per-token cost is dominated by the target model's verification step and the drafter is much cheaper to run in either case.

What the speedup actually costs is hidden in the acceptance numbers, and this is where the BeeLlama table earns its space. The Qwen row reports "67.7% / 89.2%" for the DFlash run. Read those as the two diagnostic rates that matter for speculative decoding economics: the fraction of drafter-proposed tokens that the target validates, and the fraction of drafted sequences that the target accepts without falling back. When the first number drops below about 50%, the drafter's compute starts costing more than it saves. When the second number drops below about 60%, the per-sequence overhead of the verifier path begins to dominate. Both Qwen and Gemma sit comfortably above those thresholds in BeeLlama's report, which is why the median speedups are close to the best-case numbers rather than spread across an order of magnitude.

Prefill stays near the llama.cpp baseline in every row of the table. That is the expected shape: prefill is already parallelizable across the prompt's tokens, so the speculative path has nothing to add. The 4-5x speedup is a decode-phase number. Practitioners who serve a workload of short prompts and long generations — agentic loops, chat completions, code suggestion streams — will see something near the headline. Workloads dominated by long prompts and short answers, like RAG with a 32K-token context and a one-sentence reply, will see almost no benefit because most of the wall-clock time is prefill.

A few caveats sit underneath the table and should travel with it. The reasoning-on configuration is excluded from the chat benchmark in the README, and the changelog notes a stricter fallback to full logits when "grammar, sampler state, or reasoning requires it." Reasoning models stream tokens with more entropy at each step, which reduces drafter acceptance rates and pushes the speedup back toward 2-3x. The 3090's 24 GB of VRAM is also doing real work in these numbers: holding the Q5_K_S target, the Q4_K_M drafter, and the K/V cache for both at the same time. A 12 GB card running the same models with the same quantizations would either spill to system memory or refuse to load, and the latency in either case would erase the win.

The teach is small and useful. Speculative decoding is not a free 5x — it is a 5x conditional on the drafter being trained well enough that its top-1 predictions match the target's most of the time, and conditional on the workload being decode-heavy. BeeLlama v0.2.0 ships both halves: the DFlash drafters trained against current open weights, and the verifier path tightened enough that the published acceptance rates hold. For a learner who has read the original speculative decoding paper but never seen the technique applied to a model they could run themselves, the README plus the GGUFs are a complete worked example. Clone the repo, pull either GGUF pair, and the throughput numbers reproduce.

Repo and quickstart guides: https://github.com/Anbeeld/beellama.cpp

🐧 Resize VM Disk Ubuntu LVM — Common Mistakes and How to Fix Them

Python-T Point — Sat, 23 May 2026 03:37:33 +0000

Two virtual machines, running the same Ubuntu version and application stack, hit disk exhaustion. One was back online with expanded storage in under five minutes. The other remained down for hours, requiring a full rebuild. The difference wasn’t hardware, cloud provider, or administrator skill—it came down to one architectural decision at setup: LVM versus raw partitions. When you need to resize vm disk ubuntu lvm in production, Logical Volume Manager (LVM) turns what could be an outage into a routine operational task.

📑 Table of Contents

🧠 LVM — Why Flexibility Matters
🪛 Hypervisor — Extend the Virtual Disk
🔍 Mechanism: How the Kernel Sees Resized Disks
⚠️ Gotcha: Partition Table Limits
🔧 LVM — Extend the Logical Volume
⚙️ Mechanism: Logical Extents and Metadata
✅ Verification: Check LV Size
🗂 Filesystem — Grow the Root Partition
🟩 Final Thoughts
❓ Frequently Asked Questions
Can I resize the disk without LVM?
Do I need to unmount the filesystem to resize it?
What if I have multiple logical volumes and want to allocate space selectively?
📚 References & Further Reading

🧠 LVM — Why Flexibility Matters

LVM abstracts physical storage into a layered model: disks become Physical Volumes (PVs) , which are grouped into Volume Groups (VGs) , and from those, Logical Volumes (LVs) are carved out as usable block devices. This abstraction enables online resizing—extending or shrinking volumes without unmounting filesystems or repartitioning disks. When the underlying virtual disk is expanded, LVM integrates the additional space by remapping Physical Extents (PEs) to Logical Extents (LEs). The kernel’s device-mapper layer handles I/O translation between the LV and the backing physical storage. Then, a filesystem resize updates internal metadata to use the larger block device. Without LVM, resizing requires adjusting partition boundaries with fdisk or parted, often demanding downtime and introducing risk if the root partition is involved. With LVM, the process is non-disruptive and idempotent. The full stack—hypervisor → virtual disk → PV → VG → LV → filesystem—enables safe, incremental growth. Each layer must be updated in sequence.

$ sudo pvs PV VG Fmt Attr PSize PFree /dev/sda5 ubuntu-vg lvm2 a-- 29.51g 0
$ sudo vgs VG #PV #LV #SN Attr VSize VFree ubuntu-vg 1 2 0 wz--n- 29.51g 0
$ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root ubuntu-vg -wi-ao---- 27.51g swap_1 ubuntu-vg -wi-ao---- 2.00g

These commands confirm a single PV feeding a VG with two LVs. Resizing the root filesystem starts after expanding the virtual disk.

🪛 Hypervisor — Extend the Virtual Disk

The first step in any resize vm disk ubuntu lvm procedure is increasing the virtual disk size at the hypervisor level—whether on VMware, KVM/QEMU, VirtualBox, AWS EC2, or GCP. This operation modifies the disk image (e.g., .qcow2, .vmdk) to report a larger capacity. The guest OS detects the change via a block device rescan, exposing unallocated space at the end of the disk. For KVM/QEMU with libvirt, use:

$ virsh domblklist ubuntu-vm
Target Source
vda /var/lib/libvirt/images/ubuntu-vm.qcow2
$ qemu-img resize /var/lib/libvirt/images/ubuntu-vm.qcow2 +10G
Image resized.
$ virsh blockresize ubuntu-vm vda -size 40G
Block device 'vda' is resized to 40 GiB.

Inside the guest, trigger a rescan:

$ echo 1 | sudo tee /sys/block/vda/device/rescan
1
$ lsblk | grep vda
vda 252:0 0 40G 0 disk
├─vda1 252:1 0 1G 0 part /boot
└─vda2 252:2 0 29.5G 0 part ├─ubuntu--vg-root 251:0 0 27.5G 0 lvm / └─ubuntu--vg-swap_1 251:1 0 2G 0 lvm [SWAP]

The disk (vda) is now 40G, but the LVM structures still use only ~29.5G. The ~10G of new space is unallocated.

🔍 Mechanism: How the Kernel Sees Resized Disks

Writing 1 to /sys/block/vda/device/rescan triggers the kernel to issue a READ CAPACITY SCSI command to the virtual device. The hypervisor returns the updated size, and the kernel adjusts the block device’s bd_inode->i_size. This propagates through sysfs and is reflected in lsblk. Online capacity resizing is supported for SCSI, SATA, and virtio-blk devices in modern kernels. No reboot is required.

⚠️ Gotcha: Partition Table Limits

MS-DOS partition tables cannot address disks larger than 2TB. For disks approaching or exceeding that size, use GPT. Also, ensure the extended partition (vda2) covers the full disk. If not, it must be resized. With LVM typically layered on a single large partition, run growpart to extend it:

$ sudo growpart /dev/vda 2
CHANGED: partition=2 start=2099200 old: size=62496768 end=64595968 new: size=83875807 end=85975007

This expands partition 2 to consume all available space, allowing pvresize to utilize the full disk.

🔧 LVM — Extend the Logical Volume

Now that the physical disk and partition are larger, update the LVM metadata to recognize the new capacity. Resize the physical volume:

$ sudo pvresize /dev/vda2
Physical volume "/dev/vda2" changed
1 physical volume(s) resized or updated / 0 physical volume(s) not resized
$ sudo vgs VG #PV #LV #SN Attr VSize VFree ubuntu-vg 1 2 0 wz--n- 39.51g 10.00g

pvresize scans the backing device and updates the PV's usable size. The volume group now has 10GB of free space. Extend the logical volume to use all available extents:

$ sudo lvextend -l +100%FREE /dev/ubuntu-vg/root Size of logical volume ubuntu-vg/root changed from 27.51 GiB (7042 extents) to 37.51 GiB (9602 extents). Logical volume ubuntu-vg/root successfully resized.

The -l +100%FREE flag allocates all unassigned extents in the VG. Using extents instead of byte sizes ensures precision, as LVM manages space in fixed 4MB units by default.

⚙️ Mechanism: Logical Extents and Metadata

Each PV is divided into Physical Extents (PEs) , usually 4MB. When extending an LV, LVM assigns free PEs to Logical Extents (LEs), updating its metadata stored in binary format on-disk and cached in /etc/lvm/backup/. The device-mapper driver maps LEs to PEs at runtime, transparently to the filesystem.

✅ Verification: Check LV Size

$ sudo lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert root ubuntu-vg -wi-ao---- 37.51g swap_1 ubuntu-vg -wi-ao---- 2.00g

The LV is now 37.51G. But the filesystem still operates within the old boundary.

🗂 Filesystem — Grow the Root Partition

The final step is resizing the filesystem to fill the expanded block device. For ext4 , which Ubuntu uses by default:

$ sudo resize2fs /dev/ubuntu-vg/root
resize2fs 1.46.5 (30-Dec-)
Filesystem at /dev/ubuntu-vg/root is mounted on /; on-line resizing required
old_desc_blocks = 4, new_desc_blocks = 5
The filesystem on /dev/ubuntu-vg/root is now 9833408 (4k) blocks long.

resize2fs performs several operations: - Expands block group descriptors to cover new regions - Allocates additional inode tables - Updates the superblock with the new block count For XFS :

$ sudo xfs_growfs /
meta-data=/dev/mapper/ubuntu--vg-root isize=512 agcount=4, agsize=1802752 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1
data = bsize=4096 blocks=7211008, imaxpct=5 = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=3521, version=2 = sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
data blocks changed from 7211008 to 9833408

The xfs_growfs command expands the data and inode allocation groups, recalibrating internal structures without requiring dismount. Verify the result:

$ df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-root 37G 12G 24G 35% /

The system now uses the full 37G. The resize is complete.

You don’t need downtime to grow a disk—if you built it right the first time.

🟩 Final Thoughts

The ability to resize vm disk ubuntu lvm online isn’t a convenience—it’s a resilience feature. Disk exhaustion will happen. The presence of LVM determines whether the response is routine or critical. LVM introduces minimal overhead and maximum flexibility. It doesn’t replace monitoring, but it removes urgency from capacity alerts. Resizing can occur during normal hours, with no coordination, no outage. But this flexibility must be designed in. Retrofitting LVM onto a system without it requires downtime, data migration, and complex partitioning changes. So always deploy production Ubuntu VMs with LVM enabled—even for small instances. You’re not planning for current size. You’re protecting against future growth.

❓ Frequently Asked Questions

Can I resize the disk without LVM?

Yes, but it’s significantly more complex and risky. You’d need to use parted or fdisk to delete and recreate the partition with a larger size, then resize the filesystem. This usually requires unmounting the partition or booting from external media, leading to downtime. LVM avoids this by design.

Do I need to unmount the filesystem to resize it?

For ext2/3/4 and XFS , you can grow the filesystem while mounted. This is called online resizing. However, shrinking requires the filesystem to be unmounted. Always ensure you have backups before any resize operation.

What if I have multiple logical volumes and want to allocate space selectively?

You can use lvextend with specific sizes instead of +100%FREE. For example: lvextend -L +5G /dev/ubuntu-vg/var grows only the var volume by 5GB, leaving free space for other LVs. Use vgs to monitor available space.

📚 References & Further Reading

Ubuntu Server Guide — storage configuration including LVM and filesystems: ubuntu.com
Linux man pages for key tools — definitive syntax for pvresize, lvextend, resize2fs: man7.org

Forem

The 7-Layer Memory Architecture Behind Modern AI Agents

I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI

I Imagined Hermes Agent Running an Entire Smart City — And That Changed How I See AI

From Chatbots to Autonomous Systems

The Moment Trafiq AI Started Making Sense

Why Hermes Agent Feels Important

AI Is Quietly Entering a New Phase

What Excites Me as a Student Developer

Final Thoughts

Tags

hermesagentchallenge #devchallenge #agents #ai

AULA — The AI tutor that fits in a browser tab, built for the students the internet leaves behind

What I Built

The problem AULA solves

What AULA does — offline (100% local, Gemma 4 E2B)

What AULA does — Cloud Boost (optional, Gemma 4 26B-A4B)

Demo

Key screenshots

Code

How I Used Gemma 4

Why Gemma 4 E2B for local

Why Gemma 4 26B-A4B for cloud-enhanced features

Technical challenges I solved

What Gemma 4 unlocked that wasn't possible 18 months ago

One backend, four products: why we bet on platform-per-brand

What "platform-per-brand" actually means

We rejected the obvious answer twice

The design rule that makes it work

The multi-brand pain, made concrete

What we picked, what we rejected, why

The takeaway

AI's tech debt is invisible — even to AI. I solved it at the architecture layer.

What is AI tech debt?

5 symptoms (diagnose yourself)

A real case study: my toolboxclient stateService

The first fix: project memory

Why project memory hints don't scale

The first insight

Building aming-claw (and falling into the next trap)

The real architectural insight

Every commit can correspond to one graph

4 key invariants

Why not a git hook?

How modification and rollback work

How this solves AI tech debt

Coming up: the algorithm itself

Change my mind

Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals

TL;DR

1. Why ROAS Without Gross Margin Is Misleading

2. What Gross Margin Actually Is

3. Five EC Vertical Benchmarks

4. Breakeven Falls Out Once Margin Is Locked

5. Three Levers to Improve Margin

6. Measuring Your Gross Margin in 3 Steps

Wrap-up

References

You Don’t Need to Try Every AI Tool to Keep Up

A practical note on AI tool anxiety, productivity pressure, and choosing better standards.

Two types of AI anxiety

1. Productivity anxiety

2. Tool anxiety

Trying a tool is not the same as keeping up

The feed is not a good standard

A better checklist before trying a new AI tool

1. Why do I want to use this?

2. What problem does it solve?

3. What would count as a useful result?

4. What will I stop doing if this works?

5. Am I curious, or am I anxious?

A practical example

Keeping up does not mean using everything

NovelPilot: A Novel Writing Agent Powered by Gemma 4

What I built

Demo

Sample prompt and output

Code

How I used Gemma 4

Why I chose this Gemma 4 model

🎯 AI Agent ToolBox

🛡️ WebAssembly SandBox