DEV Community

I'm a beginner who let AI do too much of the thinking — and now I want to actually learn

edris ed — Mon, 18 May 2026 13:35:21 +0000

I'll be honest with you: my programming knowledge is still pretty limited. I'm not someone who's been coding for years and just got lazy. I'm someone who started learning, discovered AI tools early on, and — without fully realizing it — let them take over the parts that were supposed to make me grow.

Every time I hit a wall, I asked the AI. Every time I had a vague idea, I asked the AI to turn it into code. It worked, in a way. Things got built. But I didn't grow. The AI was the director, and I was just copy-pasting.

Now I'm at a point where I want to change that. My goal isn't to stop using AI — I think it's an incredible tool. My goal is to reach a level where I'm the one directing it, not the other way around. Where I understand what I'm asking for, can evaluate what it gives me, and catch it when it's wrong.

If you've been through something like this — whether as a beginner or even later in your career — I'd really love to hear your story. What helped you build real understanding? Were there resources, habits, or mindset shifts that made the difference?

And if you're also in this situation right now, let's talk. Maybe we can figure it out together.

I built a PDF parser that actually preserves table structure for RAG — here's why it matters

Gunjan Tailor — Mon, 18 May 2026 13:35:15 +0000

Every RAG tutorial shows the same pipeline:

PDF → extract text → split every 512 tokens → embed → store → query

It works fine for blog posts. It completely falls apart for anything structured.

The problem nobody talks about

Take a financial report. It has a revenue table:

Region	Q2 Revenue	Q3 Revenue	Change
Europe	38.1%	45.2%	+7.1pp
Asia	29.3%	41.7%	+12.4pp
Americas	n/a	52.1%	—

After blind chunking, your LLM receives:

"45.2%  Q3  Europe  38.1%  Q2  Europe  41.7%  Q3  Asia   29.3%"

Numbers with no column headers, no caption, no context. Ask it "which region grew the most?" and you get an approximate guess — not an answer.

The same problem happens with:

Legal contracts (clause split mid-sentence)
API docs (code example separated from its description)
Research papers (figure caption disconnected from its analysis)

This isn't a retrieval problem. It's an ingestion problem.

What I built

I spent the last few months building DOCNEST — a document normalization engine that reads structure before touching content.

Instead of chunks, every heading becomes a navigable §section. Every table is preserved as structured JSON. Every section gets a one-sentence summary and a keyword index — computed once at ingest.

The output is a .udf file (Unified Document Format) — a self-contained portable knowledge base.

from docnest.parsers.pymupdf_pdf import PyMuPDFParser
from docnest.normalizer import SectionNormaliser
from docnest.writer import UDFWriter
from docnest.reader import UDFIndex

# Parse → normalise → save (no API key needed)
raw = PyMuPDFParser().parse("report.pdf")
doc = SectionNormaliser().normalise(raw)
UDFWriter().write(doc, "report.udf")

# Query
idx = UDFIndex.load("report.udf")
result = idx.query(
    "Which region had the highest Q3 growth?",
    llm_provider="groq",
    llm_model="llama-3.3-70b-versatile",
    llm_api_key="gsk_...",  # free at console.groq.com
)
print(result.answer)      # "Asia grew the most at +12.4pp"
print(result.layer_used)  # 1 — answered from index, 0 LLM tokens used

The five-layer query engine

The part I'm most proud of is how queries are resolved:

Layer	Mechanism	Tokens	When it fires
0	Pre-computed (summary, key numbers)	0	Direct match
1	BM25 + cosine → navigate to §section	0	Strong keyword match
2	Section-scoped LLM	~300	Needs interpretation
3	Multi-section synthesis	~900	Cross-section reasoning
4	Full document fallback	~4000	Nothing else worked

Layers 0 and 1 answer roughly 70% of real-world questions with zero LLM tokens. You pay for compute only when the question genuinely requires it.

How it handles large PDFs

Docling (the ML-quality PDF parser) loads full models into RAM. A 600-page PDF would exhaust memory on most machines.

DOCNEST solves this with automatic page chunking:

from docnest.parsers.pdf import DoclingPDFParser

# Auto-chunks PDFs > 30 pages — peak RAM = one chunk, not the whole file
raw = DoclingPDFParser().parse("600-page-annual-report.pdf")

# Or tune explicitly
raw = DoclingPDFParser(chunk_pages=10).parse("report.pdf")  # low RAM
raw = DoclingPDFParser(chunk_pages=50).parse("report.pdf")  # high RAM

PyMuPDF splits the PDF into N-page temp files. Docling processes each chunk at full ML quality. Sections are merged. The output is identical to processing the whole file at once.

Accuracy on a real document

I ran 25 questions against a 500-page open-source nutrition textbook using PyMuPDF + Groq's free tier:

Basic facts (calories, macronutrients): 5/5
Macronutrient detail (fiber, glycemic index): 5/5
Micronutrients (vitamins, minerals): 4/5
Hard synthesis (BMR, omega-3, antioxidants): 5/5
Edge cases (hallucination, tables, out-of-scope): 5/5

24/25 (96%) — the one failure was a table-only page where the text parser extracted no content (switch to DoclingPDFParser for those).

Try it

pip install docnest-ai pymupdf

GitHub: https://github.com/tailorgunjan93/docnest
PyPI: https://pypi.org/project/docnest-ai

It supports PDF (Docling + PyMuPDF), DOCX, XLSX, HTML, and Markdown. LLM providers: Groq, OpenAI, Ollama, Anthropic, Google, Mistral and more. Vector backends: numpy (default), FAISS, ChromaDB.

I'm building this in the open. If you've hit this table-structure problem in your own RAG pipeline, I'd genuinely like to hear what broke.

Your Agent Is Becoming the Crown Jewel: SOC, Reviews, and Governance for the Dynamic-Consent Era

Anton Staykov — Mon, 18 May 2026 13:34:00 +0000

The previous article in this series argued that the combination of incremental and dynamic user consent and Microsoft Entra Agent ID gives interactive AI agents something genuinely new: the ability to earn their access in the wild, scope by scope, prompted by the humans and other agents they work alongside. Aria, the example agent, started with two delegated permissions and grew into a productive contributor across SharePoint, ServiceNow, and the Finance API in roughly a quarter — without its creators pre-declaring any of it.

That was the optimistic half. This is the other half.

By the end of that quarter, Aria is — by any reasonable measurement — the most over-privileged identity in the tenant. No one noticed, because there was nothing to notice. Every grant was legitimate, contextual, and user-approved. The risk did not arrive in a single bad decision. It arrived as a hundred reasonable yeses.

A different kind of over-permissioning

Classic over-permissioning is an event. Someone hands a service account Directory.ReadWrite.All because the deployment was due Friday, an auditor flags it months later, a ticket is opened. Slow, but the control loop exists, and it is built around discrete moments of poor judgment.

Permission accumulation through dynamic consent is structurally different. There is no single bad decision to find. The permission graph grows monotonically — one narrow, well-justified scope at a time — because the mechanism that makes the agent useful is the same mechanism that makes it dangerous. Nothing in the platform prunes that graph by default, and nothing in most organizations does either: access-review tooling was designed around human role changes, not around agents whose role is to absorb new capabilities.

Why agents become the target

A compromised agent identity is qualitatively worse than a compromised user account, and the reasons are worth stating plainly.

A user holds permissions scattered across teams, sick days, role changes, and eventual departures. Their access constantly churns, and the blast radius of any single compromise is naturally bounded by the messiness of human work.

An agent does none of that. It persists. It centralizes. Every scope a hundred different users granted to it is reachable through one set of tokens, one blueprint, one set of credentials issued by that blueprint. Add the realistic threat surface of a modern agent — token theft, blueprint compromise, prompt injection used as a lateral-movement primitive — and the picture becomes uncomfortable: the most attractive principal in the tenant is also the one whose authority grew quietly enough to escape notice.

What the SOC must change

Most security operations centers treat sign-in logs as the primary identity signal. For agents under dynamic consent, that is no longer sufficient. The consent log itself becomes a first-class detection surface.

Three signal families deserve attention:

Scope-acquisition velocity. A productive agent acquires new scopes in bursts that follow human work. An agent that suddenly requests broad scopes — especially ones approaching admin-consent thresholds — outside its normal pattern is worth waking someone up for.
Grant-versus-use gap. Scopes that were granted but are never exercised are dead weight at best, pre-positioned capability for an attacker at worst. Track them, and feed the gap into automated revocation.
Introduction chains. When agent A pulls agent B into a workflow and B requests new scopes as a result, that chain is part of the audit story. SOC tooling needs to render it as a graph, not as isolated events.

None of these are exotic. They are sign-in analytics one layer up the stack.

What in identity governance must change

Access reviews built for humans assume a relatively stable role. The reviewer is asked, in effect, "does this person still need what they had last quarter?" That question does not work for an agent whose entire purpose is to absorb new capabilities continuously.

Three adjustments are required.

Reviews keyed to recent use, not recent grant. The relevant question is no longer "should the agent have this scope?" but "did the agent actually exercise this scope in the last N days, and was the use consistent with the original justification?" Scopes that fail both halves of that test should expire automatically.

Owners and sponsors as the accountable humans. Microsoft Entra Agent ID separates technical owners from business sponsors precisely so that someone with operational context and someone with business context can both be on the hook. Wire those roles into the review workflow. An agent without a current sponsor should not be holding sensitive delegated permissions.

Blueprint-level Conditional Access as the choke point. Because policies applied to a blueprint propagate to every agent identity created from it, the blueprint is the right place to enforce the constraints that should never be negotiable — geographic boundaries, sensitive-resource exclusions, step-up requirements for specific scope families. Treat the blueprint the way you treat a privileged-access workstation: small, hardened, watched.

A governance posture that grows with the agent

Three principles are worth taking back to the architecture board.

Consent is telemetry. Treat every dynamic consent event as a security signal of equal weight to a sign-in. Pipe it into the same analytics and the same review workflows.

Least privilege is a verb, not a noun. A static least-privilege list cannot survive contact with an agent that earns its access. The control objective is no longer to define the minimum scope set — it is to continuously prune toward it.

Grow with the agent; do not be the hurdle. The organizations that succeed will be the ones whose governance moves at the same cadence as the agent's learning. Quarterly reviews and annual recertifications were already too slow for humans. They are unworkable for agents.

Aria is going to keep growing. So will every other interactive agent in the tenant. The question for identity and security architects is not whether to allow it — that decision has already been made by the people on the other side of the chat window. The question is whether the controls, the detections, and the operating model are ready for what dynamic consent has quietly enabled.

If they are not yet, that is the work for this year.

Load PostgreSQL into Apache Iceberg with Sling

Fritz Larco — Mon, 18 May 2026 13:31:11 +0000

Introduction

Apache Iceberg is the table format that turns a pile of Parquet files in object storage into something that behaves like a warehouse table. You get schema evolution, hidden partitioning, time travel, and consistent reads from whichever engine you point at the table. PostgreSQL is where most operational data starts. Moving it into Iceberg gives you an analytics copy that DuckDB, Spark, Trino, Snowflake, and Athena can all read without anyone needing to agree on a single warehouse vendor first.

Sling speaks the Iceberg REST catalog directly. From the configuration side an Iceberg target is just another database connection: point Sling at the catalog URL and the underlying object store, then declare your streams. No JVM, no Spark, no manual manifest writing.

This guide replicates a Postgres schema into Iceberg using Sling. The catalog is Cloudflare R2's managed Iceberg REST catalog and the storage layer underneath is R2. Every CLI line, row count, and timing below comes from an actual run against those endpoints.

Installing Sling

Sling is a single binary. Pick whichever install fits:

# macOS / Linux
curl -fsSL https://slingdata.io/install.sh | bash

# Windows
irm https://slingdata.io/install.ps1 | iex

# Python
pip install sling

Confirm:

sling --version

Full install notes are in the Sling CLI Getting Started Guide.

Configuring the Postgres Source

Sling reads connection details from ~/.sling/env.yaml, environment variables, or sling conns set. A read-only user is enough:

CREATE USER sling WITH PASSWORD '<password>';
GRANT CONNECT ON DATABASE mydb TO sling;
GRANT USAGE ON SCHEMA public TO sling;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO sling;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO sling;

Then register the connection:

sling conns set POSTGRES type=postgres host=host.ip user=sling \
  database=mydb password=mypass port=5432

Or in ~/.sling/env.yaml:

connections:
  POSTGRES:
    type: postgres
    host: host.ip
    user: sling
    password: mypass
    port: 5432
    database: mydb

If your Postgres requires SSL, append sslmode: require. Test it:

sling conns test POSTGRES

The Postgres connection docs cover SSL, IAM, and the rest.

Configuring the Iceberg Target

Sling treats Iceberg as a database-class target. The connection captures two things: the catalog, which stores table metadata, and the warehouse, which stores the actual Parquet data files. Sling supports REST, AWS Glue, and SQL catalogs. This guide uses REST.

For Cloudflare R2's Iceberg catalog you need the catalog URL, an API token, the warehouse identifier (account-id + bucket name), and S3-compatible credentials for the R2 bucket underneath. All four come from the R2 dashboard.

connections:
  ICEBERG:
    type: iceberg
    catalog_type: rest
    rest_uri: https://catalog.cloudflarestorage.com/<accountid>/<bucket>
    rest_token: <r2_catalog_api_token>
    rest_warehouse: <accountid>_<bucket>
    s3_access_key_id: <r2_access_key_id>
    s3_secret_access_key: <r2_secret_access_key>

For a self-hosted Lakekeeper or Nessie catalog, the shape is the same; only the rest_uri and rest_warehouse change. For AWS Glue, set catalog_type: glue and glue_warehouse: s3://my-bucket/warehouse. The Iceberg connection docs walk through each catalog type.

Test it:

sling conns test ICEBERG

A Full-Refresh Replication

For this run the Postgres source has three tables in a demo_postgres_iceberg schema:

users — 8,000 rows
orders — 35,000 rows
events — 60,000 rows, with an occurred_at timestamp

The replication file:

# replication.yaml
source: POSTGRES
target: ICEBERG

defaults:
  mode: full-refresh
  object: demo_postgres_iceberg.{stream_table}

streams:
  demo_postgres_iceberg.users:
  demo_postgres_iceberg.orders:
  demo_postgres_iceberg.events:
    mode: incremental
    primary_key: [event_id]
    update_key: occurred_at

A few notes:

object: follows the usual <namespace>.<table> shape. Sling creates the Iceberg namespace if it doesn't already exist in the catalog.
{stream_table} is a runtime variable. Sling substitutes the source table name so you don't repeat yourself.
The third stream switches to mode: incremental with an update_key. That's the only diff between a one-shot bulk load and an ongoing append flow.

Run it:

sling run -r replication.yaml

Real output, trimmed:

INF Sling CLI | https://slingdata.io
WRN for mode 'incremental' with iceberg target, primary-key is ineffective,
    incremental merge is not yet supported (only appends)
INF Sling Replication [3 streams] | POSTGRES -> ICEBERG

INF [1 / 3] running stream demo_postgres_iceberg.users
INF created table "demo_postgres_iceberg"."users"
INF streaming data (direct insert)
INF inserted 8000 rows into "demo_postgres_iceberg"."users" in 11 secs [713 r/s] [519 kB]

INF [2 / 3] running stream demo_postgres_iceberg.orders
INF created table "demo_postgres_iceberg"."orders"
INF inserted 35000 rows into "demo_postgres_iceberg"."orders" in 9 secs [3,721 r/s] [2.1 MB]

INF [3 / 3] running stream demo_postgres_iceberg.events
INF getting checkpoint value (occurred_at)
INF writing to target database [mode: incremental]
INF created table "demo_postgres_iceberg"."events"
INF inserted 60000 rows into "demo_postgres_iceberg"."events" in 7 secs [8,190 r/s] [4.5 MB]

INF Sling Replication Completed in 29s | POSTGRES -> ICEBERG | 3 Successes | 0 Failures

103,000 rows across three tables, 29 seconds end-to-end. The warning at the top deserves a real answer; see the section on incremental modes further down.

Verification

Sling can query Iceberg tables directly through its DuckDB-backed reader. Tables are addressed as iceberg_catalog.<namespace>.<table>:

sling conns exec ICEBERG \
  "select 'users' as t, count(*) as c
     from iceberg_catalog.demo_postgres_iceberg.users
   union all
   select 'orders', count(*) from iceberg_catalog.demo_postgres_iceberg.orders
   union all
   select 'events', count(*) from iceberg_catalog.demo_postgres_iceberg.events"

+--------+-------+
| T      |     C |
+--------+-------+
| users  |  8000 |
| orders | 35000 |
| events | 60000 |
+--------+-------+

Row counts match the source. A sample of users confirms columns and types survived the trip:

sling conns exec ICEBERG \
  "select user_id, email, country, signup_at
     from iceberg_catalog.demo_postgres_iceberg.users
    order by user_id limit 5"

+---------+-------------------+---------+-------------------------------+
| USER_ID | EMAIL             | COUNTRY | SIGNUP_AT                     |
+---------+-------------------+---------+-------------------------------+
|       1 | user1@example.com | BR      | 2025-01-01 00:14:00 -0300 -03 |
|       2 | user2@example.com | DE      | 2025-01-01 00:28:00 -0300 -03 |
|       3 | user3@example.com | FR      | 2025-01-01 00:42:00 -0300 -03 |
|       4 | user4@example.com | JP      | 2025-01-01 00:56:00 -0300 -03 |
|       5 | user5@example.com | UK      | 2025-01-01 01:10:00 -0300 -03 |
+---------+-------------------+---------+-------------------------------+

Postgres jsonb lands as a structured column too. Sampling events:

+----------+---------+------------+----------------------+----------------------+
| EVENT_ID | USER_ID | EVENT_TYPE | PAYLOAD              | OCCURRED_AT          |
+----------+---------+------------+----------------------+----------------------+
|    60001 |       2 | click      | {"v": 1, "utm": "x"} | 2026-05-11 ...       |
|    60002 |       3 | signup     | {"v": 2, "utm": "x"} | 2026-05-11 ...       |
|    60003 |       4 | purchase   | {"v": 3, "utm": "x"} | 2026-05-11 ...       |
+----------+---------+------------+----------------------+----------------------+

Any other Iceberg reader sees the same data: DuckDB with the iceberg extension, Spark, Trino, Athena, Snowflake's catalog-linked databases. That portability is the reason for the catalog in the first place.

Running an Incremental Append

After the bulk load, the day-to-day shape is: every few minutes (or hours, or once a day), pick up the new rows since the last run and append them to the Iceberg table. Sling's incremental mode does this. The state (the last seen value of the update_key) is tracked by Sling itself, so you don't need to manage a state file the way you would for a file-based target.

Insert 2,500 new events on the source (a stand-in for fresh activity):

insert into demo_postgres_iceberg.events (event_id, user_id, event_type, payload, occurred_at)
select 60000 + n, 1 + (n % 8000), 'click',
       jsonb_build_object('utm','x','v', n % 100),
       now() - (n * interval '1 second')
  from generate_series(1, 2500) g(n);

Run a single-stream replication that touches only events:

# replication-incremental.yaml
source: POSTGRES
target: ICEBERG

defaults:
  object: demo_postgres_iceberg.{stream_table}

streams:
  demo_postgres_iceberg.events:
    mode: incremental
    update_key: occurred_at

sling run -r replication-incremental.yaml

INF Sling Replication | POSTGRES -> ICEBERG | demo_postgres_iceberg.events
INF getting checkpoint value (occurred_at)
INF reading from source database
INF writing to target database [mode: incremental]
INF streaming data (direct insert)
INF inserted 2500 rows into "demo_postgres_iceberg"."events" in 8 secs [294 r/s] [178 kB]
INF execution succeeded

Sling read the saved checkpoint, pulled only rows newer than the last occurred_at it saw, and appended exactly the 2,500 new rows. A readback confirms the new total:

sling conns exec ICEBERG \
  "select min(occurred_at), max(occurred_at), count(*)
     from iceberg_catalog.demo_postgres_iceberg.events"

+-------------------------------+--------------------------------------+--------+
| MIN_OCCURRED_AT               | MAX_OCCURRED_AT                      | COUNT  |
+-------------------------------+--------------------------------------+--------+
| 2025-03-01 00:00:40 -0300 -03 | 2026-05-11 08:42:59.533692 -0300 -03 |  62500 |
+-------------------------------+--------------------------------------+--------+

60,000 + 2,500 = 62,500. The new high-water mark on occurred_at is the timestamp of the freshest insert. The next scheduled run will start from there.

Append-incremental vs merge-incremental

That warning Sling printed on the first run matters:

WRN for mode 'incremental' with iceberg target, primary-key is ineffective,
    incremental merge is not yet supported (only appends)

For database targets like Postgres or Snowflake, Sling's incremental mode is a merge: a row whose primary_key already exists in the target gets updated in place. For an Iceberg target today, incremental means append only. New rows go in, existing rows stay as-is, and a primary_key declared on the stream is parsed but not enforced.

That is fine when your source is append-only: events, immutable transactions, log data. It is the wrong default if your source has mutable rows you need reflected on the lake side. Until merge lands, two patterns work:

Snapshot replays. Run mode: full-refresh on a cadence that matches your freshness budget. Iceberg's snapshot model means readers always see a consistent table; the old snapshot is replaced atomically. For tables in the low millions this is faster than it sounds.
CDC-style append plus downstream resolution. Append every Postgres change to Iceberg as-is (using a logical-replication tool or trigger-based capture) and resolve the latest-state view at read time with something like qualify row_number() over (partition by pk order by event_ts desc) = 1. A bit more work at query time, very cheap at write time.

Track the Iceberg connector docs for when full merge mode ships.

Common tweaks

Choose the right catalog. REST is the most portable: the same connection shape works for Cloudflare R2, Lakekeeper, Nessie, Polaris, and any other REST-compatible catalog. Glue is the simplest in AWS-native shops. SQL catalog is fine for local dev. Avoid wiring a different catalog per environment if you can help it; the table layout doesn't care, but the metadata location does.
Namespace organization. Treat namespaces (demo_postgres_iceberg.users) the way you treat warehouse schemas: one per source system, or one per data domain. Don't dump everything into default.
Filter at the source. Use a sql: block per stream to project columns or filter rows before they leave Postgres. Smaller Parquet files, smaller manifests, cheaper queries downstream.
Time travel for free. Every replication produces a new Iceberg snapshot. Readers can time-travel to a previous snapshot, which is useful for "what did this table look like before yesterday's run?" without storing your own backups.
Maintain the table. Like any Iceberg table, periodic compaction and snapshot expiration keep the file count and metadata size from growing without bound. Set this up on a separate schedule from the replication itself.

Where to go next

The same pattern works for any of Sling's 30+ database sources into Iceberg: MySQL, SQL Server, Snowflake, BigQuery, MongoDB, and the rest. Swap the source and leave the target alone.

If the underlying R2 storage is what brought you here, the Postgres → R2 as Parquet walkthrough shows the same source landing as raw Parquet files instead of an Iceberg table, which is useful when downstream readers don't need a catalog. For a deeper comparison of file-format targets, see Postgres → S3 as Parquet and Postgres → DuckDB.

For team workflows with scheduling, alerting, and audit trails on top of the same CLI, look at the Sling Platform.

Questions go to Discord or GitHub Issues.

I Was Wrong About AI. Here Is the Moment That Changed It.

Kiell Tampubolon — Mon, 18 May 2026 13:30:36 +0000

The debugging tool flagged a staggering 150 issues in my code almost instantly. I was astonished by how many mistakes I had made and how far I still had to go. This moment revealed the complexity of AI that I had underestimated all along.

The Moment It Broke

That breaking point was unexpected. I was toying with a fancy AI algorithm, thinking it was about to churn out perfect code. I had linked it with my code editor, set everything up, and watched with excitement as it typed out solutions based on prompts I fed it. After a few successful iterations, I made a huge mistake. I forgot to properly validate the inputs.

One afternoon, I threw in a random input to test its limits. The console displayed the message that will haunt me: "Runtime Error: Unexpected Token."

For a moment, I was frozen. I had ignored something critical: AI is a tool, not a solution in itself. More often than not, I’d been treating it as some kind of oracle instead of evaluating how it actually understood my requests. I should’ve known better. Sure, AI can assist in many ways, but nothing beats core programming principles.

What I Discovered

When I finally debugged the mess, I took a moment to reflect. I realized that I had neglected proper coding best practices in favor of a shiny new toy. AI works wonderfully when it augments your existing understanding and workflow. Here’s a snippet demonstrating the change I made:

// Original flawed function without validation
function aiSuggestion(prompt) {
 return aiModel.generate(prompt);
}

// Improved function with validation
function aiSuggestion(prompt) {
 if (typeof prompt !== 'string' || !prompt) {
 throw new Error('Invalid input: Please provide a valid string.');
 }
 return aiModel.generate(prompt);
}

Just adding that validation step made a world of difference. I could trust my AI’s feedback more because I was guiding it with better input. It was a simple tweak, but it became pivotal in my project’s success. I still smiled when my AI echoed back code that was far more functional than my initial attempts, but now I was equipped with the knowledge that I needed to do my part first.

The Bigger Principle

This lead to a bigger realization: tools are just extensions of ourselves. They don't replace the fundamentals. If you're a developer working with AI, you have to take responsibility for your code. Forgetting that turns you into a passive user, and honestly, nobody wants to be that. It’s like trying to build a house without knowing how to lay bricks; the walls might look good for a while, but they’re bound to crumble eventually.

When I look back, what annoys me most is that I didn’t question my assumptions sooner. I could’ve saved time, energy, and probably a few hairs on my head. Relying solely on AI to deliver the goods is tempting, but it leads to dangerous shortcuts. Proper coding practices don’t just lead to better outcomes; they prevent mistakes down the road.

In hindsight, I’d tell past-me to challenge the narrative that tech can solve everything. AI should elevate our skills, not replace them. So here’s my burning question: Are AI and automation a developer's best support system, or do they create a dangerous dependency that might weaken our core skills? What’s your take on this?

Stop Building Fragile Scrapers — Build Actors Instead

SIÁN Agency — Mon, 18 May 2026 13:30:00 +0000

TL;DR — A "scraper" is a script that ran once. An "actor" is a unit of work with an input contract, an output schema, observability, and a billing model. Same code, completely different operational surface. We migrated our Bayut property pipeline from the first to the second this quarter and the support load dropped 70%.

I get sent a lot of scraper repos to "review" — usually after they've broken in production. They look surprisingly similar:

One Python file, 300–600 lines.
A main() that loops over URLs.
requests.get() plus BeautifulSoup plus a try/except: pass that swallows everything.
Output written to a CSV called output.csv in the working directory.
A cron job that triggers it nightly. Sometimes a Slack webhook on failure that stopped working six months ago.

This is what I call a script that ran once. The fact that it ran in production doesn't make it production code.

The teardown is always the same.

The five failure modes you inherit when you ship a script

No input contract. The script reads URLs from a hardcoded list or a file path that only exists on your laptop. New requirement → edit the file → redeploy → hope.
No output schema. Whatever fields happened to be present this run get written. When the source site adds a column, the CSV silently widens. When the source site removes a column, downstream breaks at parse time, three hops away from the cause.
No observability. "Did it run last night?" is answered by SSH-ing to the box and ls -la output.csv. Run history is the file's mtime. Failure mode is "the file is older than expected."
No retries with backoff. A 503 from the target site at 02:14 kills the run. There is no second attempt. The next run is in 24 hours.
No billing surface. The cost of running it is your time and your server. There is no per-unit price, so there is no signal that the unit economics are bad until you check the AWS bill.

A script is fine for "I need this data once." It is not fine for "we need this data nightly for the next two years." But teams keep shipping #1 to fulfill #2.

What an actor is

Strip the marketing word and an actor is just: a containerised job with a declared input schema, a declared output schema, and a runtime that handles scheduling, retries, logs, persistent storage, and billing. Apify is one implementation — there are others. The shape matters more than the vendor.

When we rebuilt our Bayut property scraper as an actor, four things changed at the level of code:

// 1. Input is validated against a schema before main() runs.
//    Bad input fails fast with a useful error, not silent miss.
const input = await Actor.getInput(); // INPUT_SCHEMA.json enforces shape

// 2. Output goes to a typed dataset. New fields require a schema
//    change — not a silent CSV widening.
await Dataset.pushData({
  listingId, price, currency, address, lat, lng, scrapedAt
});

// 3. Failures retry with backoff at the platform level.
//    Our code throws; the runtime decides what to do.
throw new ScrapeFailure('listing-blocked', { url, status: 429 });

// 4. Logs are structured, queryable, and indexed by run.
log.warning('rate-limit', { url, retryAfter: 60 });

That's it. Same Playwright, same selectors, same scraping logic. The difference is that all the boring infrastructure — input validation, output typing, retries, logs, scheduling, billing — is no longer your problem.

Result

For Bayut specifically, three months after the migration:

Mean time to detect a breakage went from ~36 hours (next-day stakeholder complaint) to under 15 minutes (failed runs alert with the offending URL and HTTP status).
Support tickets dropped 70%. Most of the volume was "the data is missing" — invisible failures from the cron-script era. With per-run datasets, failed runs surface themselves.
Cost per 1000 listings went down, not up. Concurrency at the runtime level is cheaper than spinning up your own queue.

The migration itself took about a week. Most of the time was not the scraping logic — that was already there. It was deciding what the input schema should be, what the output schema should be, and which fields were "nice to have" vs "the dataset is broken without this."

The replacement pattern

If you're sitting on a script-shaped scraper right now, the migration order is:

Write the input schema. Force every run to declare what it's scraping.
Write the output schema. Force every row to validate before it gets persisted.
Move retries from try/except: pass to the runtime.
Replace print() with structured logs.
Containerise. Whatever runs in python main.py should run in docker run.
Pick a runtime — Apify, your own k8s cron, whatever. The schema work is portable.

You do steps 1–5 inside your existing repo. You haven't committed to a vendor yet. By the time you reach step 6, the actor exists — the runtime is just a deployment target.

We packaged this migration shape into a starter we use for every new client engagement — same six steps that produced the Bayut property scraper above. Same six steps, every time.

Which of the five failure modes is currently shipping in your stack? Drop it in the comments — I'll point at the smallest change that fixes it.

Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

I built a protocol that reduces AI prompts by 70% — here's the proof

edwin realpe preciado — Mon, 18 May 2026 13:29:09 +0000

The claim

Most developers know that AI prompts are
inconsistent. You write 80 words describing
a component, the AI generates something
close but not quite right, you iterate,
you waste time.

I've been working on a different approach:
instead of writing better prompts, what if
you had a structured protocol that eliminates
the ambiguity entirely?

That's NEXUS — a minimalist Human-AI
communication protocol. And instead of
just claiming it works, I built a library
of 25 real examples showing the before
and after.

The comparison

Here's a real example — a webhook handler:

Without NEXUS (87 words of natural language):

"Create a POST endpoint in Express to receive
Stripe webhooks. It should read the body as
raw buffer, verify the webhook signature using
stripe.webhooks.constructEvent() with the secret
from environment variables, return 400 if the
signature is invalid, and call
WebhookService.handleStripe() with the event.
Respond with received: true if everything works."

Result: variable. Depends on the model,
the day, the context.

With NEXUS (8 lines):

@Express
Controller WebhookController
  Router ApiV1
    Endpoint POST /webhooks/stripe
      !! "La firma del webhook debe ser válida"
      => WebhookService.handleStripe()
      !error:400 -> /error/invalid-signature

Result: deterministic. Every time.

The numbers across 25 examples:

Average reduction: ~70% less text
Ambiguity: zero
The AI knows exactly what to build

Why it works

Natural language compresses intent into
sentences that humans parse easily but
AI models resolve inconsistently.

NEXUS makes intent explicit:

!! preconditions fire before the action
!error:code handles failures after
=> is the action — nothing implied

The model doesn't decide what to do.
The protocol tells it.

See the 25 examples

I built a full library showing the
three-panel comparison for every example:

Natural language prompt (what you'd write today)
NEXUS blueprint (8-16 lines)
Generated code (the output)

Cards, navbars, forms, REST APIs,
authentication flows — all with the
before/after numbers.

👉 nexuslang.dev/examples

The library is open source:
💻 github.com/open-souse/Nexus
📦 npm install nxlang

Curious what examples you'd want to see
next — what's the most painful component
to describe to an AI in natural language?

OpenHuman: The Open-Source AI Assistant That Wants to Become Your Second Brain

Shubham Kumar Sinha — Mon, 18 May 2026 13:26:18 +0000

Artificial Intelligence is rapidly moving beyond simple chatbots. Today’s users want AI systems that can remember context, understand workflows, connect with tools, and actually help in durlay-to-day productivity. This is where OpenHuman GitHub Repository enters the picture.
Built by TinyHumans AI, OpenHuman is an open-source “personal AI super intelligence” designed to act more like a persistent digital companion than a temporary chatbot. Unlike traditional AI tools that forget everything after each session, OpenHuman focuses heavily on memory, personalization, privacy, and deep integration with your digital life.

What is OpenHuman?

OpenHuman is a local-first AI assistant that combines:

Long-term memory
AI agent capabilities
Tool integrations
Voice interactions
Desktop automation
Personalized context understanding

The project positions itself as a private AI runtime that learns about you continuously and becomes smarter over time. According to the official repository, OpenHuman is designed to “integrate with you in your daily life.”
What makes the project especially interesting is its ambition to move AI from “question-answering” into a real-world personal operating system for productivity.

Key Features of OpenHuman

1. Massive Memory System

One of OpenHuman’s biggest highlights is its memory architecture.
The platform claims support for up to 1 billion tokens of memory, allowing the AI to remember emails, documents, notes, workflows, meetings, and user preferences over time.
Instead of relying only on temporary chat history, OpenHuman creates a structured memory tree that continuously evolves.
This allows the assistant to:

Recall previous conversations
Understand recurring tasks
Maintain long-term user context
Learn workflows automatically

2. 118+ Integrations
OpenHuman supports over 118 third-party integrations including:

Gmail
GitHub
Slack
Notion
Google Calendar
Stripe
Jira
Linear
Google Drive

and many more.

The integrations work through OAuth connections and expose tools directly to the AI assistant.
This means the assistant can:

Read schedules
Understand projects
Summarize updates
Organize workflows
Provide proactive suggestions

3. Local-First Privacy
Privacy is becoming one of the most important discussions in AI.
OpenHuman focuses heavily on local execution and on-device memory storage. According to project documentation, data is stored locally using SQLite and can also sync into an Obsidian-compatible markdown vault.
This approach gives users more ownership over:

Personal data
AI memory
Documents
Workflows

Unlike cloud-only AI platforms, OpenHuman aims to reduce dependency on external servers.

4. Obsidian-Style Knowledge Base
Another standout feature is its Obsidian-compatible memory vault.
The system converts connected information into markdown knowledge chunks that can be browsed, edited, and organized like a personal wiki.
This creates a fascinating bridge between:

AI assistants
Personal knowledge management
Second-brain systems

For users already working with tools like Obsidian, this integration can be extremely valuable.

5. Voice and Desktop Presence
OpenHuman is not just a text chatbot.
The project includes:

Voice interactions
Speech-to-text
Text-to-speech
Animated desktop mascot
Background AI processing

The assistant is designed to feel more “alive” and persistent rather than appearing only when manually opened.

Technical Stack

OpenHuman uses a modern desktop architecture built with:

Rust
React
Tauri
TypeScript
QuickJS sandbox runtime

The main application combines a Rust-powered backend with a React-based UI, while integrations and “skills” run in isolated environments.

This architecture provides:

Better performance
Lower memory usage
Cross-platform compatibility
Stronger security isolation

Why OpenHuman is Gaining Attention

OpenHuman has recently gained significant traction in the open-source AI community.
Several reasons explain this momentum:

AI Fatigue with Traditional Chatbots

Many users are frustrated by AI systems that:

Forget context
Require repeated prompting
Lack personalization
Depend heavily on cloud services

OpenHuman directly targets these pain points.

Rise of AI Agents

The industry is rapidly shifting from static chatbots toward AI agents capable of:

Taking actions
Managing workflows
Using tools
Automating tasks

OpenHuman positions itself as part of this “agentic AI” movement.

Local AI Movement

There is increasing interest in:

Offline AI
Private AI
Self-hosted AI systems
Local LLMs

OpenHuman aligns strongly with this trend by enabling local execution and persistent user-owned memory.

Current Limitations

Despite the excitement, OpenHuman is still in early beta.
The developers openly mention:

Rough edges
Ongoing development
Potential bugs
Frequent updates

Users should expect instability while the project matures.
Additionally, because the platform handles sensitive data and deep integrations, security and permission management will remain critical areas to watch.

The Bigger Vision Behind OpenHuman

OpenHuman represents a broader shift happening in AI.
Instead of AI being:

A website
A chatbot
A prompt box

the future may look more like:

Persistent AI companions
Personal operating systems
Digital memory assistants
Autonomous workflow agents

Projects like OpenHuman are exploring what happens when AI becomes deeply integrated into daily life rather than isolated to short conversations.

Final Thoughts

OpenHuman is one of the most ambitious open-source AI assistant projects currently gaining momentum in the developer ecosystem.
Its combination of:

Long-term memory
Local-first privacy
Deep integrations
Personalized workflows
Open-source architecture

makes it stand out in an increasingly crowded AI landscape.
While the project is still early in development, it offers a compelling glimpse into the future of personal AI systems.
For developers, productivity enthusiasts, and AI researchers, OpenHuman is definitely a project worth watching.

Useful Links:

OpenHuman GitHub Repository

OpenHuman Official Website

OpenHuman Documentation

TinyHumans AI GitHub Organization

My Linkedin

The Agent's Word Is Not Enough: External Validation in the Agentic Governance Stack

Anthony Johnson II — Mon, 18 May 2026 13:25:35 +0000

This article was originally published on EthereaLogic.ai.

The first two articles in this series established a distinction that anchors the whole governance framework: the layers that live in documents tell the agent what to do, and the layers that run as code are what make the system trustworthy. Documents explain the rules. Hooks make rules physically impossible to violate inside the harness. But hooks intercept actions, not claims. The hook in GovForge blocks a direct push to main. It says nothing about whether the code the agent wrote is correct, whether the tests the agent ran actually covered the changed behavior, or whether a dependency in the last install carries a known CVE the agent did not flag. Those questions live outside the hook's jurisdiction — and outside the agent's own reporting as well.

On April 20, 2026, in the post-PR #258 sync record, the GovForge project's primary agent reported 1,361 passing tests from its local validation run: 1,159 backend and 202 frontend, all green. The most recently completed CI run on main at that moment reported 1,152 passing backend tests. The backend discrepancy was 7 tests, and the agent had not misreported anything. The tests it ran locally were real. They passed. CI's lower count reflects GOVFORGE_RUN_LLM_TESTS=0, which GovForge's CI configuration sets explicitly to disable LLM-integration tests that require a local Ollama endpoint — unsuitable for a clean CI runner that has no GPU or local model dependency. The agent's count was accurate for its local development environment. It was not accurate for the environment that governs a merge to main. CI is the layer that knows the difference.

That is what the external-validation layer is for. This is the third and final article in the EthereaLogic series on the agentic governance stack. It goes inside Layer 5 — the one that runs independently of the agent, from a clean environment with no access to the agent's session state, and treats the agent's self-report as a starting point, not a conclusion.

Layer 5 runs in an environment the agent did not configure, with tools the agent does not control, producing reports the agent cannot overwrite. The three-job shape — quality, static analysis, dependency scanning — independently covers the three principal failure modes an agent can produce without triggering a hook.

The Failure Mode Hooks Cannot Close

Hooks intercept actions. They stop an agent from taking a destructive step inside the harness — committing to a protected branch, deleting a file outside the permitted root, constructing a shell pipeline that smuggles a protected operation through a nested command. The GovForge pre-tool-use.js guard covers all of that. What it cannot cover is everything that happens once the allowed action lands.

An agent can write a test suite that passes because it tests the wrong behavior. An agent can run a dependency install and report success without checking whether any installed package has a known vulnerability. An agent can complete a typecheck under a configuration that silences the errors the updated code introduced. None of these are destructive operations in the sense the hook is designed to block. All of them are failure modes that a clean external CI run — starting from a fresh checkout with an authoritative environment definition — is positioned to surface before the PR merges.

The failure mode hooks cannot close is not about what the agent does wrong. It is about what the agent cannot see. The agent's session state is its own. Its test runner runs in its own process. Its dependency install uses its own cache. Its environment variables are its own. None of that maps cleanly onto what CI sees, because CI starts over from nothing every time. The divergence the April 20 sync record surfaced — 1,159 local backend vs. 1,152 CI backend — is not a failure. It is a correct representation of two different environments answering the same question differently. The external-validation layer's contribution is precisely that: it answers the question from outside.

What External Validation Actually Is

External validation, in the context of this governance stack, means a CI suite that runs in a clean runner environment, on a fresh checkout, with no access to the agent's session state, and produces reports the agent cannot overwrite or amend.

Each of those properties matters independently. Clean runner means no implicit carry-over from the agent's local environment — no agent-generated environment variables, no state from prior sessions, and no packages except those explicitly cached in the workflow definition itself. Fresh checkout means CI sees exactly the committed code, not the agent's working tree. No session-state access means CI does not know what the agent ran locally, what the agent reported, or what the agent believes to be true. Reports the agent cannot overwrite is what makes the external-validation layer irreversible: Codacy's analysis, Snyk's dependency scan, and Codecov's coverage upload are generated by tools the agent did not write and does not control, attached to the commit or PR as artifacts that exist independently of anything the agent says.

The tool configuration that produces those properties is not complex, but it has a shape that has emerged consistently across the four production projects in the development directory: a quality job, a static-analysis job, and a dependency-scanning job, each running independently with blocking behavior assigned deliberately. The shape is the point. Any one job can miss a failure mode the other two catch; all three together cover the principal failure modes an agent can introduce without triggering a hook.

The Three-Job Shape

The GovForge ci.yml is 79 lines and contains three jobs: lint-and-test, codacy, snyk. It is the reference shape for the pattern.

The lint-and-test job runs on Python 3.11 and 3.12 in a matrix, which means every push produces two job instances — each running the full gate sequence: marker scan, Ruff lint, mypy typecheck, pytest with coverage, frontend tests, and frontend build. The matrix is load-bearing: an agent can inadvertently introduce a type annotation or syntax form that is valid in one Python version and invalid in another, and the matrix catches the divergence before the PR merges. The job sets GOVFORGE_RUN_LLM_TESTS: "0" at the env level, which is the environment variable whose value explains the April 20 test delta. That variable is the CI configuration's way of stating that LLM-integration tests are out of scope for a clean runner: they require a local Ollama endpoint (http://localhost:11434), and a clean CI runner has no GPU or locally running model to satisfy that dependency. The agent runs them locally because local development benefits from the full test surface. CI does not run them because CI is not local development.

The codacy job runs Codacy's analysis CLI from a SHA-pinned action. Codacy is a static-analysis platform that inspects code for quality issues, security patterns, complexity violations, and duplication — patterns that pass linting and typechecking but signal structural problems. It applies its own rule set, not the project's. An agent that writes code that passes Ruff and mypy can still produce code that Codacy flags as a cyclomatic complexity violation or a security anti-pattern. The codacy job has no continue-on-error flag, which means a Codacy block fails the overall CI status.

The snyk job runs Snyk's dependency scanner from a SHA-pinned action, with continue-on-error: true. The continue-on-error flag is not a concession; it is a deliberate design choice. Snyk operates against a live vulnerability database that is updated continuously. A Snyk finding on a push may reflect a CVE disclosed hours ago against a dependency that has not yet shipped a patched version. Blocking the merge on a finding with no available fix produces a CI configuration that generates blocked PRs with no actionable resolution path. continue-on-error: true means the scan executes in CI and its output is visible in the workflow logs without blocking the merge; the finding is produced independently of the agent's self-report and is the operator's responsibility to triage. AetheriaForge and DriftSentinel carry the same Snyk configuration for the same reason.

AetheriaForge and DriftSentinel add a fourth element: Codecov upload via codecov/codecov-action@75cd11691c0faa626561e295848008c8a7dddffe # v5, configured with fail_ci_if_error: true. Codecov is a coverage-tracking service. The upload produces a coverage report attached to the PR that is independent of the agent's local coverage output. fail_ci_if_error: true means that if the upload fails — network error, invalid token, malformed report — CI fails rather than silently omitting the coverage signal. The Codecov report is not a gate on a coverage percentage floor in these projects, but it makes coverage trends visible across PRs and does so from outside the agent's session. The agent's local pytest run also produces coverage output; the Codecov report is the one that is persisted, diffed against prior runs, and attached to the PR as an independent artifact.

ADWS Pro implements the same governance intent with a different job layout: a test job that runs the quality and coverage gates with local Codacy-equivalent and Codecov-equivalent checks inline, a separate security job for the local Snyk-equivalent vulnerability gate, a post-merge-signal job that writes the CI outcome (passed or regressed) to a named artifact (adws-post-merge-outcome), and an sbom job that generates a software bill of materials on every push. A separate drift-sentinel.yml workflow (50 lines) adds PR drift detection via drift_report.json. The ADWS Pro CI surface across both workflow files totals 163 lines.

The Environment Gate and the Test Delta

The April 20, 2026 sync record is the clearest available example of why the external-validation layer matters even when the agent is operating in complete good faith.

The agent's local count — 1,361 passing tests — was a correct measurement of the GovForge test suite running in the local development environment. Every test the agent ran passed against real code. The sync record documents the measurement in detail: 1,160 backend tests collected, 1,159 passed, 1 skipped; 202 frontend tests passed across 33 test files; make validate exited 0. The agent reported what it measured. The sync record documents the claim with precision.

CI's count — 1,152 passing backend tests, 8 skipped — reflects a different environment definition. The GOVFORGE_RUN_LLM_TESTS=0 environment variable, declared at the job level in ci.yml, disables the LLM-integration test suite. That suite has 7 tests marked @pytest.mark.llm that require a locally running Ollama endpoint at http://localhost:11434. Those tests exercise real production code paths through GovForge's model-routing layer, but a clean CI runner has no GPU or local model process to satisfy the endpoint check in conftest.py. The CI configuration excludes them deliberately. The agent's local development environment, where Ollama is running, does not.

The result is a documented, reproducible, and fully explained divergence between the agent's self-report and CI's independent count — 7 backend tests' difference. Neither number is wrong. Both are correct descriptions of different environments applying different criteria to the same question. The external-validation layer's role is not to catch the agent lying. It is to answer the question from the environment that governs whether code ships. The agent's local environment is useful evidence. It is not authoritative evidence. CI is.

The agent's local count and CI's count are both correct. Both accurately describe their respective environments. Only CI's count governs whether the branch merges — and CI's environment is defined by the workflow file, not by the agent's session.

This distinction is the single most important property of the external-validation layer, and it is the one most likely to be papered over in an agentic deployment that has only the first four layers. A team that treats an agent's self-reported test pass as a merge signal without independent CI confirmation is implicitly trusting that the agent's environment matches the CI environment, that the agent's test configuration matches the CI configuration, and that the agent's dependency state matches what a fresh install would produce. All three assumptions are wrong on a long enough timeline, and all three are corrected by the time an external CI run finishes.

SHA-Pinning and the Infrastructure CI Runs On

The external-validation layer depends on the CI infrastructure itself being trustworthy. If the actions that CI invokes are mutable — that is, if the identifier used to reference them can resolve to different code on different days — then CI is not actually independent. It is dependent on whatever the action maintainer most recently published under a given tag.

This is not a theoretical risk. GitHub's own documentation on hardening workflows for third-party actions names mutable version tags as a documented attack vector. A tag like v4 points to the latest commit on the v4 release line; if the action maintainer pushes a new commit to that line, every workflow referencing @v4 begins running the new code on its next invocation. The workflow author may not know. CI does not inherently warn that the referenced tag now resolves to different code. The behavior change is silent.

SHA-pinning closes this class entirely. A reference like actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 resolves to exactly one commit, permanently. If the action maintainer pushes a new commit to the v6.0.2 tag, the SHA-pinned workflow is unaffected — it still resolves to the commit that was current at the time the workflow was authored. The comment annotation (# v6.0.2) serves the human reader; the SHA serves the runtime. Both are required, in the same way that a well-written hook has a clear stderr message for the agent and an exit code 2 for the harness.

Representative SHA-pinned actions from across the four production projects include:

# Checkout — GovForge and ADWS Pro
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

# Checkout — AetheriaForge and DriftSentinel
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4.3.1

# Python setup — GovForge
- uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0

# uv setup — GovForge
- uses: astral-sh/setup-uv@08807647e7069bb48b6ef5acd8ec9567f424441b # v8.1.0

# Bun setup — GovForge
- uses: oven-sh/setup-bun@0c5077e51419868618aeaa5fe8019c62421857d6 # v2

# Codecov — AetheriaForge and DriftSentinel
- uses: codecov/codecov-action@75cd11691c0faa626561e295848008c8a7dddffe # v5

# Codacy — GovForge, AetheriaForge, DriftSentinel
- uses: codacy/codacy-analysis-cli-action@d43360362776a6789b47b99ae8973510854e2d3d # master

# Snyk — GovForge, AetheriaForge, DriftSentinel
- uses: snyk/actions/python@9adf32b1121593767fc3c057af55b55db032dc04 # master

# PyPI publish — DriftSentinel and AetheriaForge
- uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0

The two earlier-stage projects — spec-driven-docs-system and sdlc_app — use unversioned tag references (actions/checkout@v6, actions/checkout@v4, actions/setup-node@v4) for their standard setup actions. spec-driven-docs-system SHA-pins the non-standard gitleaks action (gitleaks/gitleaks-action@ff98106e4c7b2bc287b24eaf42907196329070c7 # v2.3.9) while leaving the standard actions on floating tags. sdlc_app pins nothing. Both gaps are documented as known rather than deliberate — the same production-project standard has not yet been backported to either earlier-stage project.

SHA-pinning is the practice that most visibly distinguishes a CI configuration that has been audited from one that has been copied from a tutorial. Most tutorials use version tags because version tags are easier to read and maintain. That ease is the same property that makes them mutable. SHAs trade legibility for integrity. The # v6.0.2 comment restores most of the legibility without giving up the integrity. For an agentic project where CI is the independent verifier, allowing the verifier to silently change its behavior is the same class of problem as allowing the agent to modify its own test suite. The SHA is not legibility overhead. It is integrity.

The Rule That Makes the Layer Load-Bearing

External validation collapses into theater if the agent's self-report can substitute for CI's report when the two disagree.

The rule that prevents this is stated in the first article in this series and worth repeating precisely: if the agent claims tests pass, CI confirms it; if CI disagrees, the claim is unverified. There is no third case. A PR does not merge because the agent says it should. A PR merges because CI says the gates passed. The distinction is operationally significant: when an agent reports that a branch is ready to merge, the response is not to merge it but to wait for CI.

This rule has to be enforced at the workflow level, not the instruction level. An instruction that says "wait for CI before merging" is a document-layer directive with document-layer enforceability: the agent reads it and acts on it correctly, or it does not. The recommended enforcement mechanism is branch protection — a GitHub branch protection rule that requires all CI checks to pass before a PR can be merged, with no administrative override available to the agent. When configured, that setting exists outside the agent's control and outside the operator's day-to-day attention; if CI fails, the merge button is unavailable. The rule becomes structural rather than advisory.

This independence holds only if the workflow definition and required status checks are themselves protected from agent modification. An agent with repository write access could propose a change to .github/workflows/ci.yml in a PR, but it cannot merge that PR if branch protection requires the existing CI checks to pass first — and it cannot bypass the checks by renaming or removing them without a merge that the required checks themselves would block. That circularity is the structural guarantee.

For agentic workflows specifically, branch protection is the intended analog of the hook: both transform a written policy into a structural barrier. The hook prevents the agent from committing to main directly. Branch protection prevents anyone — agent or operator — from merging a PR without a clean CI run. Together they close the full path from agent action to main: the hook closes the direct-push path; branch protection closes the PR-merge path. Neither is sufficient without the other, and the external-validation layer is what branch protection is designed to enforce. Whether that enforcement is currently wired is a per-project configuration decision; the pattern described here is the target state.

Facts

The following are measured facts drawn from the development directory and the local workflow configurations of the projects referenced, verified on May 17, 2026. They should be read within the scope of those projects.

Across the six active projects in the development directory, 465 total lines of GitHub Actions workflow YAML are in place across the primary CI workflows: GovForge ci.yml (79 lines, 3 jobs), AetheriaForge ci.yml (72 lines, 3 jobs), DriftSentinel ci.yml (74 lines, 3 jobs), ADWS Pro ci.yml (113 lines, 4 jobs), spec-driven-docs-system ci.yml (97 lines, 3 jobs), and sdlc_app ci.yml (30 lines, 1 job). Additional workflow files — project-sync.yml in GovForge, drift-sentinel.yml in ADWS Pro (50 lines), and separate publish.yml files in AetheriaForge and DriftSentinel — are not included in this count.
The three-job production shape (quality via lint-and-test, static analysis via codacy, dependency scanning via snyk) is present in GovForge, AetheriaForge, and DriftSentinel. ADWS Pro implements the same governance intent with a different layout — test, security, post-merge-signal, and sbom jobs, with quality and coverage checks inline in test and the vulnerability gate in security — plus a separate drift-sentinel.yml workflow. spec-driven-docs-system carries a three-job shape with different tools (smoke, security, isolated-install). sdlc_app carries a single validate job.
The April 20, 2026 GovForge sync record (anchor commit cabee9e72ca57b860bc1a967ec8d40fe9b37cda5) documents the agent-local vs. CI test count divergence: 1,361 passing tests locally (1,159 backend passed + 1 skipped + 202 frontend) vs. 1,152 passing backend tests in the reference CI run (with 8 skipped across 1,160 collected). The 7-test backend delta — local 1,159 vs. CI 1,152 — is the LLM-integration test suite, disabled in CI via GOVFORGE_RUN_LLM_TESTS: "0" declared at the lint-and-test job level. The CI run for PR #258 (commit cabee9e, which added 4 frontend tests) was in-progress at the moment the sync record was finalized; its CI counts were not yet confirmed at that timestamp.
All GitHub Actions invoked in ADWS Pro, GovForge, AetheriaForge, and DriftSentinel workflow files are pinned to specific commit SHAs rather than version tags. Representative SHA-pinned references are shown in the SHA-Pinning section above; the full set of pinned actions in each workflow file exceeds what is listed there. The # <version> comment annotation appears alongside each SHA to preserve human readability.
spec-driven-docs-system and sdlc_app use unversioned tag references for standard GitHub actions (actions/checkout@v6, actions/checkout@v4, actions/setup-node@v4, actions/setup-python@v6). spec-driven-docs-system SHA-pins the non-standard gitleaks action (gitleaks/gitleaks-action@ff98106e4c7b2bc287b24eaf42907196329070c7 # v2.3.9) while leaving the standard actions on floating tags. Both gaps are documented as known rather than deliberate.
Snyk is configured with continue-on-error: true in GovForge, AetheriaForge, and DriftSentinel — the dependency scan executes and its output is visible in the workflow logs, but a Snyk finding does not block the merge. Codecov is configured with fail_ci_if_error: true in AetheriaForge and DriftSentinel — a coverage upload error is CI-blocking. The codacy job in all three projects has no continue-on-error flag, meaning a Codacy analysis failure causes the check to fail.
The GovForge lint-and-test job declares GOVFORGE_RUN_LLM_TESTS: "0" at the env level and runs a Python 3.11 / 3.12 matrix. No equivalent LLM-test gate appears in the AetheriaForge, DriftSentinel, or ADWS Pro CI configurations at the time of writing — AetheriaForge and DriftSentinel also run a Python 3.11 / 3.12 matrix but do not use an environment-gated test suite.
DriftSentinel runs 416 tests under pytest as of the measurements in the first article in this series (verified April 30, 2026).

Interpretation

The following are engineering judgments drawn from operating the external-validation layer on these projects. They should be read as claims about the author's experience, not universal prescriptions.

The self-reporting problem is structural, not behavioral. The April 20 test delta is not a case where the agent made an error. It is a case where the agent's environment and CI's environment differ in a defined, documented, and deliberate way. The agent's count is true in the agent's environment. CI's count is true in CI's environment. The difference between the two is load-bearing — it reflects a decision about what should and should not gate a merge to main. Without CI, that decision has no enforcement mechanism. The agent cannot know what CI knows, because CI's environment is not the agent's environment and is designed not to be.

The three-job shape is the minimum, not the target. Lint-and-test, static analysis, and dependency scanning together cover the three principal failure modes an agent can produce without triggering a hook: incorrect behavior, code-quality regressions, and supply-chain vulnerabilities. Any one job alone misses the other two. A team that runs only tests will ship code that passes tests and fails static analysis. A team that runs only Codacy will have no coverage signal and no dependency exposure. The three jobs are the minimum surface for an external-validation layer that can plausibly verify an agent's self-report across the dimensions that matter most in a production context.

The continue-on-error: true decision for Snyk is an operational judgment, not a governance gap. Snyk reports against a live vulnerability database. A CVE can be disclosed and Snyk's database updated within hours of a merge. Blocking a merge on a finding with no available fix produces a situation where the project cannot merge until someone patches a transitive dependency that the project does not control. The right response is to surface the finding in CI output and make it the operator's responsibility to triage. Treating continue-on-error: true as a gap misunderstands the tradeoff; treating it as equivalent to not running Snyk misunderstands the value. The scan runs, the output exists in CI logs, and that output is produced independently of the agent's self-report regardless of whether it blocks the merge.

SHA-pinning is the practice that distinguishes a configured CI pipeline from a tutorial copy. The cost of SHA-pinning an action is seconds per action: look up the SHA for the version you want, substitute it in the workflow file, annotate the comment. The benefit is that the CI pipeline's behavior is frozen at the version you chose, permanently, regardless of what the action maintainer does next. For an agentic project where CI is the independent verifier, allowing the verifier to silently change its behavior is the same class of problem as allowing the agent to modify its own test suite. The SHA is not legibility overhead. It is integrity.

The external-validation layer makes one assumption the rest of the stack does not. Every other layer in the governance stack works with the agent: documents guide it, hooks constrain it, agent specialization shapes it. The external-validation layer does not work with the agent at all. It assumes the agent's self-report is not authoritative, and it provides the authoritative answer from outside. That assumption is the one most agentic coding deployments quietly omit, because the agent's self-report is usually right and building a layer that assumes it might not be feels like friction. It is not friction. It is the layer in the stack least exposed to the influence of an adversarial subagent, a misconfigured local environment, a stale cache, an undeclared environment variable, a mutable action tag, or a newly disclosed CVE — and the one whose reports exist independently of whatever the agent says about them. It is the layer that makes the output of the whole stack verifiable.

Practical Implications for Teams Considering the Pattern

If your team has hooks and no external validation, the next step is to wire a CI workflow with at least a quality job, a static-analysis job, and a dependency-scanning job. Each job should run independently, with blocking behavior assigned deliberately: quality and static analysis are good candidates for merge-blocking required checks, while dependency scanning may be better configured as advisory — surfacing findings in CI output without blocking merges on CVEs that have no available fix yet. The recommended enforcement mechanism for the blocking jobs is branch protection configured at the repository level — required status checks that block the merge button until those checks pass — rather than an agent instruction that relies on the agent reading it correctly. An instruction to wait for CI is a document; branch protection is the structural control.

When wiring the workflow, SHA-pin every action you reference. This step is the one most teams defer because it feels like premature hardening. It is not premature. The cost is minutes per repository. The benefit is that your CI infrastructure does not silently change behavior because an action maintainer updated a tag. For a project that relies on CI to independently verify agent output, CI's own stability is not a detail. A workflow that SHA-pins its own actions and then uses those actions to verify agent-produced code is consistent end-to-end. A workflow that uses floating tags is consistent except for the part that matters most.

Choose your error-handling flags deliberately. Snyk with continue-on-error: true and Codecov with fail_ci_if_error: true are not inconsistent. They reflect different judgments about what should block a merge and what should surface as a report. The choice is not "block or ignore" but "block or surface." A blocking Snyk finding with no available fix produces a stalled project; a non-blocking Snyk scan still produces independent CI output about the dependency surface regardless of what the agent reported.

If your team has CI but still treats the agent's self-report as sufficient before CI completes, the operational habit to build is: the agent's count does not close the question, CI's count does. This habit is mechanical in principle and harder in practice than it sounds, because the agent's self-report arrives earlier — usually before CI has finished — and it is usually right. The times it diverges from CI are exactly the times the external-validation layer earns its place in the stack. Those times are not rare; they are the scheduled condition of every project that has environment-gated tests, matrix builds, or a dependency surface that drifts faster than local installs.

If you are starting a new project, wire the three-job shape on the first commit alongside the hook and the governance documents. The reference workflow is 79 lines. The SHA-pinning adds one annotation comment per action. The branch protection rule is a repository setting, not an agent instruction. A project that ships its first commit with a working hook, a working CI pipeline, and SHA-pinned actions has answered the three questions engineering leaders actually ask about agentic coding — governance, error rates, and security vulnerabilities — from its first day of operation. Retrofitting this layer onto a project that has been running without it requires re-auditing every previous agent-produced output that was merged on the agent's word alone. Starting governed is the lower-cost path, and it is only available at the beginning.

The five-layer governance stack is complete when all five layers are in place. The external-validation layer is the last one, and it is the one that makes the whole stack verifiable from outside. Without it, the stack is better than documentation alone — the hooks hold, the agents are specialized, the constitution governs the directives. But the output is still self-reported. The external-validation layer changes "the agent says it passed" to "CI confirms it passed." That distinction is what regulated businesses need before they can ship agentic output into a production environment with confidence.

Get the templates

The CI workflow configurations described in this article — the three-job GovForge reference shape with SHA-pinned actions, the branch protection rule guidance, and the Snyk/Codecov configuration patterns — are available as part of the agentic governance starter kit at etherealogic.ai/agentic-governance-stack-templates. The starter kit includes the document-foundation templates from the first article, the protected-branch hook from the second, and the CI workflow from this one.

References

Anthropic Claude Code documentation — Claude Hooks specification and Settings reference.
GitHub — "Security hardening for GitHub Actions" — recommends pinning third-party actions to a full commit SHA to defend against mutable-tag supply-chain risk.
AGENTS.md open standard — agentsmd/agents.md, governed by the Linux Foundation's Agentic AI Foundation.
Codacy analysis CLI action — codacy/codacy-analysis-cli-action.
Snyk GitHub Actions — snyk/actions.
Codecov GitHub Action — codecov/codecov-action.
First article in this series — CLAUDE.md Is Not Enough: The Governance Stack for Agentic Development.
Second article in this series — Exit Code 2: How Claude Hooks Turn Agentic Rules Into Runtime Barriers.

This is the third and final article in the EthereaLogic series on the agentic governance stack. The full five-layer stack — navigation files, constitutional governance, agent specialization, runtime enforcement, and external validation — is available as a drop-in starter kit at etherealogic.ai/agentic-governance-stack-templates.

Why I think Component-Driven Development needs a rethink in the Signal era

Alex — Mon, 18 May 2026 13:23:59 +0000

Component-Driven Development assumed a render model that signal-based Angular has quietly left behind. The tooling has not caught up, and I am not sure simply patching it will be enough.

The moment that made me write this

I was building ng-prism, an Angular-native component showcase tool I maintain, and I needed to update a component's inputs from a controls panel. The naive version looked roughly like this:

const ref = vcr.createComponent(ButtonComponent);
const instance = ref.instance;

instance.variant = 'primary';
instance.label  = 'Save';
ref.changeDetectorRef.detectChanges();

It works on a pre-signal component. On a signal-based component it quietly breaks the component. variant and label are no longer plain properties. They are InputSignal objects, callable as variant(). Angular stores the InputSignal as a plain class field; there is no defineProperty setter or proxy to intercept the write. So the assignment just overwrites the field reference with a string, and Angular never finds out. The next render then blows up the first time the template evaluates variant(), because variant is no longer a function. I verified this against @angular/core 21.2.0: createInputSignal() returns a plain function with a [SIGNAL] symbol attached, no setter trap in sight.

The fix is one line:

ref.setInput('variant', 'primary');
ref.setInput('label',   'Save');

That is the version that lives in ng-prism today, in the renderer effect that drives every showcase:

// packages/ng-prism/src/app/renderer/prism-renderer.component.ts
// Simplified; real version also handles content projection and unknown-input warnings.
effect(() => {
  const inputs = this.rendererService.inputValues();
  const ref    = this.componentRef;
  if (!ref) return;

  performance.mark('prism:rerender:start');
  for (const [key, value] of Object.entries(inputs)) {
    ref.setInput(key, value);
  }
  // setInput() already marks dirty and schedules CD. The explicit
  // detectChanges() here forces it to run synchronously so the
  // performance.mark below wraps the actual render, not just the
  // dirty-marking. Also keeps timings predictable under zoneless.
  ref.changeDetectorRef.detectChanges();
  performance.mark('prism:rerender:end');
});

This is a tiny detail, but it is the symptom of a much bigger thing. The way Component-Driven Development (CDD) thinks about a component (props in, render out, args table on the side) was modelled on a world where component inputs were plain properties. That world is gone in Angular. And once you stop assuming it, a lot of the tooling we have built over the last decade starts to look like it is solving the wrong problem.

The implicit model behind classic CDD

Look at what almost every CDD tool, Storybook included, has agreed on for years:

A component is a pure-ish function of its inputs.
Inputs are configured as a flat dictionary, the “args”.
A “story” is a specific value of that dictionary.
Changing args triggers a discrete re-render cycle.
Addons (a11y, viewport, knobs, controls) hook into that cycle.

In the pre-Hooks era this was the right abstraction. React class components had setState. Angular had @Input() decorators and ngOnChanges. Vue had options API. All of them had a clear, discrete moment where a property got assigned, the framework noticed, and the component re-rendered top-down. Args tables map onto that perfectly. Whatever knob you turn becomes a property assignment, which becomes an ngOnChanges call, which becomes a render.

This is also why Storybook’s args model felt so natural for so long. Args are properties. Properties are state. State drives render. Story = scenario = property snapshot.

It was a beautiful, simple model. It is also, for Angular today, the wrong abstraction.

What Signals actually changed

I do not want to retread the “signals are great” territory. What matters for CDD is more specific.

Inputs are no longer properties. An InputSignal is a callable getter. From outside the component, the only correct way to push a value into it is ComponentRef.setInput(name, value). There is no property to assign anymore. Tools that still build their abstractions on “assign this prop” are not even wrong yet, they just produce broken components.

Components are nodes in a reactive graph, not pure functions of inputs. Half of a real-world signal-based component is computed() derived state. Those derived signals are part of the component’s behaviour. They are not inputs, but they are also not opaque internal state. An args table cannot represent them. A story shapes inputs, not derived state.

Lifecycle work is increasingly split between explicit hooks and reactive subscriptions. ngOnInit and ngOnDestroy still exist and are still idiomatic for a lot of cases (subscriptions, teardown, one-shot setup). What has changed is that ngOnChanges is largely irrelevant for signal inputs, and a growing share of the work that used to live in lifecycle hooks now lives inside effect() or computed() that is read by the template. The places where work happens have multiplied, and a chunk of it fires on a much finer-grained schedule than mount → change → destroy. CDD tools that visualise lifecycle as those three states are visualising a real but shrinking slice of what the component actually does.

Re-renders are continuous, not cycle-based. Zoneless Angular still has ApplicationRef.tick() and a ChangeDetectionScheduler. What is gone is the implicit tick driven by zone.js intercepting every async operation. In a zoneless app, change detection runs because the scheduler decided to run it, which in practice is because a signal flagged a view dirty. From the outside that looks less like a single render cycle and more like a graph of signals notifying their dependents at fine granularity. An addon that says “re-run my check on each render cycle” has no clean event to listen to, because there is no single render cycle visible at the public surface of the component.

The pre-signal mental model is not catastrophically wrong. It is just lossy. And every layer of CDD tooling pays the price of that loss somewhere.

Where the tooling lags, concretely

Three examples I have hit while building ng-prism.

1. Controls and args still treat inputs as a flat property bag.

The Storybook args object and ng-prism’s own variant config both look like this:

{ variant: 'primary', label: 'Save', disabled: false }

That is fine as a starting state. It is wrong as a model of how the component actually consumes its inputs. A signal-based input has a default expression, can be required, can be tied through computed() into other state, and can be read multiple times per render. None of that is in the args object. The args object is a snapshot of a tree it cannot see.

2. A11y, perf, and visual-diff addons assume “render happened, now check”.

axe-core, layout measurements, screenshot capture: they all rely on a moment where the DOM has settled and you can inspect it. Angular does give you some primitives here. afterNextRender() fires once after the next render. afterRender() fires after every render. ApplicationRef.isStable exposes a stable-state observable. None of those quite match what an a11y or visual-diff addon actually wants, which is closer to “the reactive graph has been quiet for N milliseconds across multiple microtasks, and I am now safe to walk the DOM”. That concept does not exist as a framework primitive. So you debounce, you wait for animation frames, you hope. In ng-prism the built-in a11y audit runs after a 500ms debounce on signal changes (A11yAuditService.scheduleAudit, default debounceMs = 500), which works in practice but feels like a workaround. I am not sure a true “graph is quiescent” signal even makes sense in a system designed to update continuously.

3. The “scenario” unit is too coarse.

A Storybook story is one set of args. A ng-prism variant is one set of input values. Neither can express “this component embedded in a parent that emits a signal stream over time”. The interesting bug in a signal-based component is rarely the static case. It is the transition. It is the moment when an upstream computed() updates twice in the same microtask and your effect runs once instead of twice. You cannot represent that with { variant: 'primary' }.

What I tried in ng-prism, honestly

ng-prism started life with the same implicit model as Storybook. Components have inputs. Inputs have values. Variants are named tuples of input values. The decorator looks similar:

@Showcase({
  title: 'Button',
  variants: [
    { name: 'Primary', inputs: { variant: 'primary', label: 'Save' } },
    { name: 'Danger',  inputs: { variant: 'danger', disabled: true } },
  ],
})

What ng-prism does that I think is on the right track:

Uses setInput() for every input push, never property assignment. So at least the signal contract is respected. This is the absolute floor and a surprising number of Angular-adjacent tools still get it wrong.
Drives the rendering loop with effect() on a signal of input values, not with a render-cycle event. The renderer reacts to whatever upstream signal happens to change.
Treats the scanner as a build-time concern. Signal inputs are recognised via the TypeScript Compiler API at build time, so the runtime never has to read decorator metadata. The decorator itself is literally a no-op, just a marker.
Supports zoneless via an opt-in flag in the ng add schematic (--zoneless), which wires up provideZonelessChangeDetection() and drops zone.js from polyfills. The default still ships with zone.js, but nothing in the renderer depends on an implicit zone tick; change detection runs because something signal-shaped told it to.

What ng-prism does not solve yet, and where I think the whole category is still stuck:

The variant model is still a flat input snapshot. I cannot describe a component as “embedded in a parent that pushes this signal stream over 2 seconds”.
computed() derived state is invisible in the UI. You see the inputs you set. You do not see the graph that hangs off them. For some components, that graph is the interesting part.
The a11y, perf, and box-model panels all hook into renderer output the way an old addon would. They debounce. They do not subscribe to the actual reactive graph the component is part of.
The “code snippet” feature generates a template string from input values. It cannot show how the component would behave in a parent where one of those inputs is itself a signal.

I am being deliberate about this in the docs. I do not want ng-prism to claim a level of insight into signal-based components that it does not yet deliver.

What I think the next generation should look like

This is the speculative part, so take it as opinion.

I think the unit of CDD will stop being “the component with these props” and start being “the component embedded in a reactive context”. A scenario will look more like a tiny fixture: this component, mounted under this parent, with these signal sources feeding it, over time. Closer to a Playwright scenario than a Storybook story.

I think tooling will need to render the reactive graph, not just the visual output. For a serious component, the dependency graph between inputs, computed(), and effect() is the documentation. We render the box and the controls. We should also be able to render the graph.

I think the addon model needs to flip. Instead of hooking into a render lifecycle that no longer exists, addons should subscribe to specific signals. Visual diff subscribes to the renderedElement signal. A11y subscribes to the same. Perf subscribes to a “graph quiescent” signal that the framework would have to expose. None of this lives in Storybook’s current architecture, and bolting it on is not obviously the right answer.

And I think “args” as a UI metaphor has run its course. A signal-based component is not configured by setting properties. It is wired up. The control surface should reflect that.

Closing

I am not arguing that Storybook is bad, or that the people maintaining it have missed something obvious. They built the right tool for the framework world of 2018, and that tool became a standard. Standards lag by definition. That is fine.

What I am arguing is that the abstraction is starting to leak in ways that matter. Signal-based Angular is a different kind of component runtime, not a faster version of the old one, and the surface area that CDD tooling was designed against has changed underneath it.

I am building ng-prism partly to test that hypothesis in code, not just in prose. I am also fully prepared to find out that I am wrong, that an args table plus setInput() plus a debounce is good enough for 95% of cases, and that the rest is academic. Possible. But I do not think it is, and I would rather find out by building than by predicting.

If you have hit the same kind of friction, or you have a counter-example where the args model still maps cleanly onto a signal-heavy component, I would genuinely like to hear it. That is more useful than another “signals are great” thread.

ng-prism is open source on GitHub. Feedback and counter-examples especially welcome.

Why umask 022 creates 755 folders and 644 files

authur — Mon, 18 May 2026 13:23:49 +0000

If you have ever wondered why umask 022 creates directories like 755 but files like 644, the short answer is:

directories start from 777
regular files usually start from 666
umask subtracts permissions from those defaults

So 022 does not mean "set permissions to 022". It means "remove these permission bits from the default mode".

The common example

For directories:

777 - 022 = 755

For regular files:

666 - 022 = 644

That is why a newly created folder often becomes:

drwxr-xr-x

And a newly created file often becomes:

-rw-r--r--

Why files do not start from 777

Directories need the execute bit so users can enter and traverse them.

Regular files usually do not start as executable. That is why the default starting point for many newly created files is 666, not 777.

So this is expected:

umask 022 -> folders 755
umask 022 -> files 644

More examples

umask 002 -> folders 775, files 664
umask 027 -> folders 750, files 640
umask 077 -> folders 700, files 600

002 is common when a shared group needs write access.

027 is stricter: group can read and enter directories, but others get no access.

077 is private: only the owner gets access.

chmod vs umask

A lot of confusion comes from mixing up chmod and umask.

chmod changes permissions on existing files or directories.

umask controls the default permissions for new files and directories.

So if a file already exists, changing your umask will not change that file. You would use chmod for that.

If new files are being created with the wrong permissions, then checking the current umask makes sense.

Quick checklist

When debugging a permissions problem, check these in order:

What user is creating the file?
What group owns the parent directory?
What is the current umask?
Is the filesystem a normal Linux filesystem, or something mounted with options like uid, gid, umask, fmask, or dmask?
Are you trying to fix an existing file, or the defaults for future files?

That last question usually tells you whether you need chmod or umask.

I made a small browser-local calculator for this because I kept checking the same examples repeatedly:

https://webutilslab.com/umask-calculator/?ref=devto

It runs in the browser and is mostly useful for quickly verifying cases like 022, 027, and 077.

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

SleepyQuant — Mon, 18 May 2026 13:21:03 +0000

Qwen 3.6 enable_thinking — The MoE Pitfall That Broke My Agent JSON Parsing

I lost two hours last week to a Qwen 3.6 quirk that doesn't show up in any quickstart guide. My agent kept returning malformed JSON. Logs showed the model output started with <think> and a 200-token reasoning monologue before the actual JSON I asked for. Parser exploded every time.

The fix is one keyword argument. The frustration is that nothing in the obvious places — model card, MLX docs, generic chat template examples — tells you about it.

If you're running Qwen 3.6 MoE for an agent setup and your structured outputs are broken, read on.

The symptom

I had a tool-calling loop that asked Qwen to emit JSON. Something like:

prompt = "Return a JSON object with keys 'action' and 'target'."
response = generate(model, tokenizer, prompt)
data = json.loads(response)

Worked fine with Qwen 2.5. Broke immediately with Qwen 3.6. The output looked like:

<think>
The user wants a JSON object. I need to think about what action and target make sense.
Let me consider the context...
[200 more tokens of reasoning]
</think>

{"action": "search", "target": "weather"}

JSON parser saw the <think> block as garbage, threw a JSONDecodeError. Easy enough to spot once I logged the raw output. But it took me a while to realize this was a model feature, not a prompt problem.

What's actually happening

Qwen 3.6 ships with reasoning mode default-on. The chat template injects markers — <think> and </think> — and the model is trained to fill them with its chain-of-thought before producing the user-facing answer. For interactive chat, this is sometimes useful: you can show or hide the reasoning to a user, and the reasoning content does measurably improve answer quality on hard problems.

For an agent loop that parses structured output, it's silently destructive. Every response starts with hundreds of tokens you have to strip before you can use the actual answer. And worse, the reasoning length is unpredictable — sometimes 50 tokens, sometimes 800 — so your max_tokens budget gets eaten by thinking instead of output. On a memory-tight Mac running a 35B model already, those wasted tokens also fragment Metal cache faster — separate problem but they compound. (I wrote up the memory side in my MLX memory safety checklist if that's the angle you hit first.)

The fix

In apply_chat_template, pass enable_thinking=False:

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # <-- this
)
response = generate(model, tokenizer, text)

That's it. No <think> blocks, no reasoning preamble, just the answer. JSON parses cleanly. max_tokens budget goes to the actual response.

Where the flag has to go

This took me embarrassingly long to figure out. The flag belongs at template apply time, not at generation time. You can't pass it to model.generate() and have it work. You can't set it as a tokenizer kwarg at load time. It only has effect inside apply_chat_template.

I tried these wrong things first:

# These do nothing — flag is ignored
generate(model, tokenizer, prompt, enable_thinking=False)
tokenizer = AutoTokenizer.from_pretrained(model_id, enable_thinking=False)
model.generate(prompt, enable_thinking=False)

If you've inherited a codebase where chat formatting is wrapped in a custom function, the wrapper probably calls apply_chat_template somewhere. That's the spot. Patch it there.

When you actually want thinking on

For interactive chat where a user reads the response, leaving enable_thinking=True (the default) usually helps. The model is genuinely smarter on multi-step reasoning when it gets to think out loud. Math problems, code debugging, multi-constraint planning — all measurably better with thinking on.

So the rule isn't "always disable." It's "disable for any path where the output gets machine-parsed, kept on for any path where a human reads it."

In my own setup (a multi-agent local stack on M1 Max — full hardware notes in the 19 GB memory compression writeup), I split into two generate functions:

def generate_for_agent(messages, max_tokens=512):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=False  # parser-safe
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

def generate_for_chat(messages, max_tokens=2000):
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True,
        enable_thinking=True  # quality boost for chat
    )
    return generate(model, tokenizer, text, max_tokens=max_tokens)

Two functions, two contexts. Same model, same tokenizer, different chat template flag. Clean separation.

Why the docs don't surface this

This is my speculation, not authoritative — but here's what I think happened. Qwen 3.6 launched as Alibaba's flagship reasoning model. The whole pitch is "thinks before it answers." Disabling that flag in the quickstart would undercut the marketing of the feature itself. So the docs assume you want thinking on by default, and the flag is buried in API reference, not the first-page tutorial.

If your use case is agent JSON, you'll find this gotcha on day one. If your use case is human chat, you might never need to touch the flag and won't see why anyone would.

It's a real-world case where the default optimizes for the most demo-worthy path, not the most common production path.

Verification

After patching, you can verify the flag took effect by inspecting the rendered template before generation:

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
print(text[-200:])  # tail of the prompt

You should see the assistant generation prompt with no <think> marker. If you see <think> in the tail, the flag didn't apply — most likely because you're calling a wrapper that doesn't pass it through.

You can also check by inspecting the first 100 tokens of any response. Reasoning-on output starts with <think>. Reasoning-off output starts with the actual answer.

What this isn't

This is specifically Qwen 3.6 behavior. Earlier Qwen versions (2.5 and below) don't have the enable_thinking flag because reasoning mode wasn't a feature yet. Other reasoning-mode models (DeepSeek-R1, the o1 family on the OpenAI API) have similar dynamics but different flags or modes — check their respective chat templates.

If your output isn't parsable but doesn't have <think> blocks, the cause is somewhere else. Common alternatives I've hit:

Trailing whitespace or newlines in the response — strip before parsing
Markdown code-fence wrapping around the JSON — strip json ` and `
Model adding explanatory text before/after the JSON — tighten the system prompt with explicit "no preamble, no explanation"

The <think> block fix only solves the reasoning-leak case. The other cases need other fixes.

The smaller lesson

When a new model breaks an existing pipeline silently, the bug is usually in the chat template, not the generate call. The template is the interface between your code and the model's expectations. Most upstream API changes happen there.

For Qwen 3.6, the gotcha is enable_thinking. For the next model in two months, it'll be something else. The diagnostic habit — log the rendered template, not just the response — saves hours over the year.

If you've hit a different Qwen 3.6 surprise that nobody flags, I'd genuinely like to know. Reply on the post.

Come along for the ride — see me fall or thrive, whichever comes first.