Stories by Jarek Potiuk on Medium

Intention is all you need

Jarek Potiuk — Sun, 19 Apr 2026 21:56:38 GMT

Why, after many years of writing code to automate my maintainer work, I now write English instead — and why the result is better, faster, and more honest.

Intention is All You Need

A few weeks ago, an AI agent I’ve been using to help with Apache Airflow security triaging made a classic mistake.

A researcher reported a vulnerability via a GitHub forwarder. The agent did almost everything right: it picked the correct recipient and wrote a perfectly professional body. But then, it hallucinated a subject line and opened a brand-new email thread instead of replying to the original one and he politely reminded me to keep things in thread.

Even 6 months ago, my fix for this could be: write a pre-commit hook or a brittle Python wrapper around the Gmail API. This time, I just added two paragraphs to a Markdown file.

That file is now how every agent reads to avoid that mistake. It’s also how I build almost all my maintainer tooling now. The reason it works isn’t that the AI is “smart” — it’s that maintainer knowledge has always been about intent, and we finally have a place to write that intent down without burying it in code.

The Body of Work

As a PMC member for Airflow and member of the ASF security committee, I often do all kinds of triaging. I’m jumping between Gmail, private GitHub trackers, Vulnogram, and mailing lists. Every step requires the kind of judgment that often live in the heads of a few senior maintainers and security team.

The volume isn’t what kills you; physically switching and clicking through windows, mindless copy & pasting between those different tools. And you need to keep context between different report. Every report has a “gotcha” — maybe the reporter is anonymous, or it’s a contested duplicate, or reporter provides additional information. Traditional automation fails here because there is no single “if/then” path. It doesn’t need a script; it needs guidance.

The Skill is the Tool

Over the last few months, I’ve moved my entire security workflow into a toolkit of six “skills” powered by agent prompts. They do the heavy lifting: reconciling state between GitHub and Vulnogram, allocating CVEs, and drafting advisories.

But here’s the kicker: almost all of this is just English. It’s a numbered list of steps: look at this, check that, show the human this before you hit send. The only actual code is a tiny Python script for CVE formatting — and even the tests for it look more like English examples than unit tests.

This is quite an inversion of my old workflow.

The Old Way: Notice a pattern → write a script → hook it into CI → spend time fixing edge cases in a script no one else understands.
The New Way: Notice a pattern → realise what you want to do -> tell your agent to update the skill it used -> review and correct proposal. If it messes up, add a paragraph to the file or fix the writing.

The second way is faster, but more importantly, it’s more honest. A script is a leaky abstraction of what a maintainer wants; a skill file is the actual intent.

Debugging in English

When an agent fails now, the fix is a rule, not a patch.

In the threading incident, I realized the agent was improvising because I hadn’t codified the “ASF-relay” case. The fix wasn’t a new Gmail-threading library; it was two paragraphs in a file called AGENTS.md. One rule explained that threadId is mandatory, and the other explained exactly how to handle relayed reports.

No new validator, no exception classes. Just English. A few weeks later, when I realized the rule was too rigid, I updated the prose to allow for a fallback. The fix follows the problem.

Architecture by Osmosis

The interesting side effect is that a clean architecture emerged naturally from the writing.

I recently did a big refactor of the toolkit. I realized I had three things mixed together: general security lifecycles, Airflow-specific labels, and tool mechanics (like GitHub CLI commands). Then, I asked my agent, to refactor it. I told it the abstractions I want to have, I split the tree into four layers: generic skills, project folders, tool adapters, and a user config.

I didn’t design this on a whiteboard. I just noticed that the prose was not well organized, I found it difficult to know where to look for things, the different layers of abstraction were mixed together.

Especially — if I wanted to check how something works, it took me quite a bit of time to find out where it is.

And it turned out that refactoring of “English” instructions works similarly to refactoring the code and it has similar effect.

Closing the Loop

This approach even helps with “AI slop.” We now have a section in Airflow’s AGENTS.md for external contributors using AI. It includes a self-review checklist: don't fabricate diffs, check for N+1 queries, remove unrelated changes. We’re even using these rules to help Copilot review incoming PRs. We’re taking the tribal knowledge senior reviewers have internalized over a decade and finally putting it on paper.

What This Isn’t

To be clear:

It doesn’t replace testing. My JSON generator still has a suite of hard tests.
It doesn’t replace accountability. I still sign off on every change with a Generated-by tag. I'm responsible for the result; the skill just handles the mechanics.
It doesn’t replace review. It just changes what we review. I can read and amend a skill file in seconds, and so can any other maintainer — even if they aren’t a “coder.”

The Transformation

For fifteen years, I’ve tried to turn my expertise into code. But there was always a translation loss, because code wants to tell a computer what to do, whereas I wanted to tell a collaborator what I cared about.

Writing “skills” lets me express my and my team expertise in plain English for other intelligent readers — human or machine. If you’re a maintainer, try it. Pick one tedious task. Don’t write a script. Write a set of instructions — just prose — and point an agent at it. When it fails, don’t patch a bug. Add a paragraph. Or even better — ask your agent to express it for you.

You’ll eventually realize that your “tooling” has become a body of work that is readable, reviewable, and actually reflects the wisdom you’ve picked up over the years.

Intention really is all you need.

Modern Python monorepo for Apache Airflow Ⓡ — Part 4

Jarek Potiuk — Mon, 15 Dec 2025 21:44:32 GMT

Modern Python monorepo for Apache Airflow Ⓡ — Part 4

Part 4. Shared “static” libraries in Airflow monorepo

In the first three Parts of the series, Part 1. Pains of big modular Python projects and Part 2. Modern Python packaging standards and tools for monorepos and Part 3. Monorepo on steroids — modular prek hooks I described the problems that maintainers of big python projects have when trying to modularise them, how modern Python packaging standards help to solve them, how other tools — like prek — following the same philosophy and approach can help us to modularise the project and make it easy to contribute by literally thousands of contributors.

However if you put this all together — monorepo, modern packaging, and tools like prek, it turns out that it enables us to come up with new ideas and solve some problems that we were unable to solve in the past, but we always wanted — meet “static” code sharing in Python.

Difficulties of code sharing in Python

In big projects like Airflow usually you want to have a number of components that you want to reuse — components like logging, configuration, settings, observability comes to mind. Also when your components want to share certain “skills” — ways of handling certain features, and you want to release separate components using those skills, you often want to have code that is written once and used multiple times in different components. While you would normally use some shared libraries for those functionalities, often you want to use “your own specific way” of using them — which you want to share across multiple components. Or you simply want to implement your own custom logic implementing certain functionality only once, and use it in many places. Classic DRY principles apply here.

However DRY has its own share of limitations especially when you want those components to be usable on their own, while also being able to install them together in the same environment. And the main problem you have to deal with is Coupling. The classic approach when you want to share such functionality in Python is to build a shared distribution in the form of a library, that you want to release separately and use in those components — installing the library only once.

Code sharing puctured as shared box used by dfferent tools

And that works fine as long as you want to release and install your components together — all of them use the same version of the shared code and life is good. But … Here is where coupling problems start. What happens if you want to release the components independently and each of those components — when releasing it, uses different versions of the same shared library, and you want to run both components and the library they share in the same interpreter. Coupling kicks-in immediately.

The Python packaging ecosystem and the way libraries are loaded in the interpreter do not give us an easy solution on what to do when two components in one interpreter want to use the same shared library and they both expect different versions of that library. Simply — you can only have a single library version loaded at the same time, Which version do you install then? Will it continue to work when Component A uses version 1.0 and Component B uses version 3.7 of your library? Will 3.7 work for Component A? Initially you can solve the problem by following strict SemVer versioning, never introducing breaking changes, running your tests against older versions of your libraries.

Problems with classic Python libraries used for sharing

But this escalates quickly as the number of shared components grows — the number of shared libraries you have to test against causes a matrix of compatibility tests to materialise. This quickly grows in size — and the more shared libraries you have the matrix starts getting new dimensions and the number of combinations you have to test against will not be maintainable very quickly. You need to maintain compatibility code, and when finally one of your shared libraries starts using another shared library — you finally give up on even imagining how many combinations of those you have to test together. It’s impossible to reason about side-effects different combinations might have.

All Python maintainers of any sizeable project had to — at some point in time — deal with this problem of libraries with conflicting requirements for shared code:

A depends on library X in version < 2.0
B depends on library X in version >= 2.0

How do I install A and B together? Classic.

Coupling issues when you want to use conflictiing reuqirements for shared libraries

And this is not a new issue of course — different ecosystems solved similar problems in various ways:

npm ecosystem allows javascript packages to use specific versions of shared libraries and install them at the same time in single javascript environment
dynamic .so libraries in Linux/Unix can have several major/minor versions installed at the same time and your application might use specific version of the library

This is again the “classic” solution to the problem, however it has some drawbacks — namely the overhead to manage those shared versions remains. You need to keep Semver compatibility and release your packages when breaking changes/new features/bugfixes happen appropriately. This adds a lot of overhead and when you have many contributors contributing to the shared code, you need to be extra careful and add guardrails to not introduce breaking changes or new features accidentally, and you need to make sure that your components use the right versions when they switch to new features. The libraries need to be installed in all versions that can be used, upgrade scenarios balloon. When this approach is used in scale, you might end up with literally 100s of versions of shared packages installed (if you did not know why your “node_modules” folder takes 2GB of disk space — now you know why). This is not a perfect solution and early decisions in Python were made not to handle this in the same way.

So .. what can we do in Python when you want to build different components from a single monorepo, and you want to have a shared library there, but you want to be able to use the shared code in the exact version that was in the repo at the time your package was released? And be able to install two components released at different times — effectively each of them using different versions of the shared library? Benefiting from DRY code reuse, but not paying the price of Coupling?

Pucture with having cake and eating it too for packages

That sounds like the “eat cake and have it too” conundrum — again. We’ve shown how to solve similar conundrum in Part 2. Modern Python packaging standards and tools for monorepos, so maybe we can do it this time as well?

Solution: Static, shared libraries in Python monorepo projects

Again, we do not have to reinvent the wheel, people dealt with it in custom ways and other systems for long years and there are some solutions that we can learn from:

The static “.a” libraries in the Unix/Linux system work in exactly the way we want it to work here — when you use a specific version of such a library, the library code is included in the resulting component. When you run the component, you don’t even have to have the library installed separately — it comes embedded in the component you run
For years Python maintainers solved the problem of using specific versions of libraries by “vendoring-in” those libraries into their projects. They simply copied the library code to a different module inside of their code, and modified the library to make sure it imports its packages from that new module. Often stripping out the library — removing unnecessary functionality, but often doing it in the way that you should be able to upgrade the vendored-in code when new version of the library is released

Those are two different solutions to similar problem — with the difference that for Linux/Unix system libraries, the tooling (compilers and linkers) handle it for you — you need to declare if your library should be statically or dynamically linked, and the compiler and linker will figure it out for you. But in Python — all that work has to be done manually — `pip 25.3` has 19 packages vendored-in in their `_vendor` module, and they even have their own vendoring tool that we tried to use in the past https://pypi.org/project/vendoring/ — but even description of the tool contains words “home-grown setup” and “if the project is going to be a PyPI package, it should not use this tool.” disclaimer. Not too encouraging for reuse.

But… We thought that we have a bit of a less generic problem of the same kind. We do not want to have “vendored-in” external packages — we wanted to share our own libraries between different components. Libraries that we already share in a single monorepo. We already have those projects organised in a workspace — we know where the projects are, the code is available in the same commit, in the same repo, so we thought — it should not be difficult to automatically “vendor-in” (or rather “share-in”) such shared code.

Meet Airflow shared “static” libraries

How Airflow shared libraries work

Initially we wanted to plug-in the builds into the build backends of Python. With modern Packaging we can add build hooks that can be executed at different times, and it seemed that we could use the pip vendoring tool and copy Python files while replacing the imports when those tools were running. And while we had a running prototype, this was brittle, did not integrate well with IDEs and required additional actions during development — like running `uv sync` in order to copy newly updated code from shared library to the component using it. The devex of that solution was not good.

And there came the “a-ha!” moment and with “hold my beer” I explored different solutions — much more low-tech and based on a pre-historic things coming from Linux/Unix: symbolic links. Symbolic links are already heavily used by the “shared libraries” concept there — you symbolically link major library versions to specific minor and patchlevel libraries, and we already used symbolic links in Airflow source code for other things. What helped also is that Airflow development has POSIX requirements — in order to develop Airflow on Windows, you need to use WSL2 and checkout Airflow on POSIX filesystem anyway, so using symbolic links in our repo was not limited by Windows.

It turned out that when we symbolically link a library module to another (“_shared”) module inside the target component, things work as expected. The shared code can be edited and used directly in the IDE — directly in the module where it is linked. The imports work locally. We could modify the hatchling build backend we used so that in the resulting packages, the symbolic links are actually replaced with copies of the shared library folders.

Left side: code in repo (two different commits, different version) — right side — two components deployed with different shared library versions embedded in their packages.

What you end up with is what you see in the picture above:

In sources of Airflow, when you check-out the repository, all components use and share the same library code (“current version”)
However, when you build a dristribution package for specific component, the shared library code is “frozen” and the symlinks are replaced with copy of the shared library that was in the repository in the same commit that was used to build the component
This way — each component can effectively have a different snapshot of the same library, copied inside of their own package and used by only this component. If two components installed at the same time were build from different commit of the monorepo — each component will have it’s own snapshot of the shared library — potentially different.

There were some limits and the problems to solve:

you could only use relative imports in those libraries
you should not import from original shared module in component that uses it, only used the “_shared” module for it
pyproject.toml had to be synchronized — when it comes to references between libraries and requirements, we should merge requirements of the component using library with the library requirements — and makes sure they are not conflicting
How do we explain all the limits and ways to import the code to 1000s of contributors so that it does not slip-through review proces

Well, if you read Part 2. Modern Python packaging standards and tools for monorepos and Part 3. Monorepo on steroids — modular prek hooks, you might even guess what helped us to find solutions:

uv workspaces
prek

Workspaces and `uv sync` give us the “dependency synchronization” — one of the main features of workspace is that it automatically checks for conflicts between multiple distributions you have in your repo, errors out when it sees conflicts and it’s dependency resolution mechanism figures out automatically set of dependencies that are good for all distributions

Prek and it’s hook allow us to write simple hooks that parse modified code with AST parser, figures out the imports, and errors-out — with clear instructions what to do — if you imported your code wrongly — when you used non-relative import to a shared code, or when you imported code from another `_shared` module you do not depend on. We even vibe-coded such a hook.

But there is more — prek also makes sure that all the declaration of shared libraries is done properly and consistently. There are a few places in pyproject.toml files that have to be kept in sync with the list of shared libraries — requirements, hatchling confguration, also symbolic links have to be created appropriately. Prek hooks we wrote make sure that all that is synchronized.

What we ended up with is the concept of shared libraries that are part of airflow now — and we are gearing up to release our more modularised airflow — with multiple, independent packages in Airflow 3.2: https://github.com/apache/airflow/tree/main/shared. Currently we have 7 such shared libraries, but there are few more in the works.

We now have a really nice working mechanism sharing the code with DRY philosophy, while we do not have to pay the Coupling price.

Are you feeling hungry ? I’m not. I ate the cake. But I still have it too and we can share it with you.

The Bright Future of Python Monorepos

Looking back, it’s incredible how far modern standards and tooling for Python have come. What was once a custom, often painful, endeavor for projects like Airflow, is now streamlined and efficient. uv as our chosen development environment, with uv sync at its core, has provided a level of stability and reproducibility we could only dream of before. The workspace concept, championed by Astral, has allowed us to truly manage our hundreds of distributions as independent yet deeply connected entities. And it’s now being implemented by others.

When we worked together with Charlie and Jo — they saw Airflow’s adoption as a way to validate their assumption of the uv and prek approach, showing that a project of our scale can embrace these tools and even innovate further. We all believe we have proven that even huge monorepos can be managed effectively with the right tools, and that speed and UX are essential for developer productivity and project health.

Moving beyond our custom solutions to adopt industry-standard PEP packaging has been a massive step forward. For us, it’s not just about managing more distributions; it’s about doing it in a way that’s sustainable, scalable, and welcoming to a diverse contributor base.

The future is indeed bright! Modern Python tooling has finally enabled standard and slick ways of managing huge projects, allowing us to focus on building great software rather than battling our build systems.

Bright future of monorepos in Python

One thing that we have to do now is to make sure that those approaches to workspace and possibly the approach of Airflow to shared “static” libraries within the monorepo are discussed, agreed and standardised in Python community, I hope next year for me will be a year where I will be able to work with the Python Packaging Authority, find a few co-authors, discuss and agree on the workspace solutions with Charlie, Ofek and other people who are involved and turn the workspace solution into a common standard that any other projects will be able to use and a number of other tools might implement.

Modern Python monorepo for Apache Airflow Ⓡ — Part 4 was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modern Python monorepo for Apache Airflow Ⓡ — Part 3

Jarek Potiuk — Mon, 15 Dec 2025 21:44:11 GMT

Modern Python monorepo for Apache Airflow Ⓡ — Part 3

Part 3. Monorepo on steroids — modular prek hooks

In the two previous parts of the series — Part 1. Pains of big modular Python projects and Part 2. Modern Python packaging standards and tools for monorepos, I described the challenges of big projects that want to go modular, and how modern Python packaging standards and tooling make it easy to have big monorepo-bound projects with literally 100s of distributions.

But this was just the beginning — because it turns out that we had bigger needs and this modern tooling enabled us to make our development efforts to be much more streamlined and handle literally thousands (as of this writing Airflow has more than 3500 contributors) people to contribute to Airflow. All of them should contribute without the fear of breaking anything, and without even thinking and remembering about some special things they need to take care of when they are contributing in their parts — unless they want to work on splitting and separating the distributions on their own, they just work on their part and do not need to know all the details.

Thousands contributors contributing to Apache Airflow

Partially because of the packaging standards and tools — standard pyproject.toml files in each distribution, uv sync that does what you expect it to do when you are in each distribution, but also partially because we make a good use of other fantastic tools that we adopted, and when we miss something we can utilise them to easier manage some custom solutions that we still need — and did not yet make it into standards, or did not get the tooling to implement those (yet).

Meet pre-commit management with `prek` hooks and how it enables us to get the most of our monorepo by introducing shared libraries concept.

Prek: Monorepo friendly pre-commit hooks

For many years we used `pre-commit` — a great tool used by the Python community that allowed to organize and re-use pre-commit hooks to keep your code tidy, run various checks and code processing when you commit the code automatically and efficiently — effectively implementing shift-left approach where checks, fixes and

automation should run as closely as possible to the development time — integrating nicely with the development workflow and detecting and fixing problems locally even before your code is verified in CI. That — for years — helped us to keep iterating on the code much faster, and allowed us to scale our development with thousands of contributors — effectively making the pre-commit hooks be the first level of reviews and checks — everything that reviewers could complain about has been meticulously coded into prek-hooks, so that contributors can iterate and fix on their own, without even bothering the maintainers.

Hower — with our scale the original `pre-commit` tool had shown its shortcomings. We have more than 170 pre-commit hooks in Airflow repo — and while they are usually running fast due to the way how pre-commit works, when you started to modify just a small part of the code, the overhead for just running the hooks was already quite significant. Also we did not have much success with convincing the maintainer of pre-commit — Anthony Sottile — to add, or even accept our contributions of the features Airflow needed. Even as simple as auto-complete of hook ids in command line. We were seen as an outlier and our needs were apparently too much for the pre-commit author to adapt to both scenarios — not following the philosophy of Charlie and the Astral team “easy for small, scalable for huge.”. We needed something that will handle the scale and enable the 170+ hooks of ours to work together in an orchestrated way that will handle our scale.

Hundreds of prek hooks working in orchestration to keep Airflow project in order

Meet Jo, the creator of prek. What started as a weekend / hobby project (initially named prefligit — which created some easy typosquatting issues — hence it was changed) — turned into a great tooling that Airflow adopted instead of `pre-commit` and never looked back. Jo wanted to see how it is to make a rust-based 1–1 replacement for `pre-commit` — Jo shared that he was largely inspired by the work Charlie did with uv and ruff. According to Jo, prek was designed to bring the same speed, modern UX, and monorepo support that uv brought to package management, but to the world of pre-commit hooks. And Jo was very eager to collaborate and respond to our needs, so we worked together with Jo to make it happen. As you can see — a common pattern here — in Airflow we not only get what others create but also take an active part in shaping those tools.

With prek’s monorepo support, we were able to modularize our static checks, splitting those 150+ scripts into more than 15 modularized configurations. This “divide and conquer” approach dramatically improved the maintainability and performance of the CI/DEV static checks part of the pipeline, making it a much smoother experience for contributors. Not mentioning the speed and smoothness of the installation and management of the execution environments. While the pre-commit author refused to support uv as the way how to install and manage necessary dependencies, and relied on pure-pip installation, the fact that prek utilizes uv makes the initial development experience way faster at Airflow scale, Instead of having to wait several minutes first time when you run pre-commit, the whole prek environment with prek takes tens of seconds at most. This is the difference between something you might use, and something you absolutely want to use.

This led us to this:

Find command showing 11 pre-commit-configs in Airflow repository

We have only just started to separate our pre-commit hooks into the distributions we have — and we do not have to have one config per-distribution, like it is with pyproject.toml — but we already see the benefits of this approach and how nicely it plays with our monorepo setup.

Similarly as in case of `uv` and workspace you can just do this:

Command showing easy way to run prek in the folder of distribution we want to run prek for

And it will run only the hooks that should be run for this distribution. This helps our contributors to focus on what matters for them, if they work on a specific distribution they can locally and manually run only the subset of hooks that is specific to their distribution, which makes the iterations even faster and overhead smaller.

Before prek, our pre-commit checks were a major bottleneck. The long execution time discouraged local runs, leading to more CI failures and longer feedback loops. prek changed all that; the modularization and speed improvements meant developers could run relevant checks quickly and efficiently, catching issues much earlier. This flexibility, combined with easy support for more programming languages, has been instrumental in evolving Airflow’s architecture, allowing us to implement “static” code sharing that heavily relied on `prek` keeping things in order.

This leads us to Part 4. Shared “static” libraries in Airflow monorepo of the series — describing how we solved another “have cake and eat it too” problem with the modern monorepo workspace and prek hooks — sharing code between independent distributions.

Modern Python monorepo for Apache Airflow Ⓡ — Part 3 was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modern Python monorepo for Apache Airflow Ⓡ — Part 2

Jarek Potiuk — Sun, 14 Dec 2025 20:43:24 GMT

Modern Python monorepo for Apache Airflow Ⓡ — Part 2

Part 2. Modern Python packaging standards and tools for monorepos

This blog post is the second part in the series of posts describing the Apache Airflow Ⓡ approach for developing, big, modular project with monorepo. Part 1. Pains of big modular Python projects described challenges of big projects that want to follow modular structure, allowing people to work at the same time on the whole project and separate parts as well. This part focuses on the difficulty of managing custom monorepo approaches, and how modern Python packaging makes it a breeze.

Problems with custom modularity

The custom monorepo approach for Airflow had become a source of significant overhead, as highlighted by Airflow maintainers- my friends Ash Berlin-Taylor, Kaxil Naik and Amogh Desai — but also many other maintainers and contributors had a lot to complain about. When things broke only a few (or even one — me) of us could diagnose and make fixes, there was a lot of spaghetti-code resulting from implementing incremental fixes when we needed to fix things quickly. The sheer amount of custom code to handle our multiple distributions built from a single source tree was overwhelming.

Many discussions with other maintainers made it clear that the necessity of moving past the existing setup’s limitations, particularly given the exploration of support for other languages, such as GoLang is needed. The traditional Python packaging landscape, with its numerous distributions, disconnected tooling and “everyone has their own ways”, presented a considerable challenge. The absence of standardized packaging and robust tooling, made it necessary for the team to develop custom solutions, which ultimately created substantial friction for contributors.

Also — such a custom approach was only possible for things that had very well defined interface — we had a very clear and simple API to separate so-called Providers from the main core of Airflow, but actually in our case the core of Airflow itself had been amazing spaghetti or seemingly independent modules and functionalities, and we already paid the price of leaky abstractions between them, circular dependencies, hidden complexities and “god like” behaviours inside Airflow. Our modularity was leaking all over the place — and applying our custom solution that we developed for Providers, was not an option. Not even close.

Modular spaghetti code

The Shifting Landscape of Python Tooling

But then, the brewing revolution in Python tooling came to the rescue, and we followed it very closely and even took active part in discussing some of those changes.. The whole Python Packaging Authority team worked relentlessly over the last few years, recognizing, that lack of good standard and even lack of common terminology and conventions have been a huge drag on the community, and that standardising a lot of approaches there is absolutely necessary, especially that other languages that were created recently, learning from the past came out with excellent development and packaging tooling from the start — Rust’s `cargo` was often touted as the “best of the breed”.

And while Python Packaging Authority — consisting mostly of volunteers contributing code and ideas to the Python Software Foundation, it was next to impossible for them to come up with a project or set of projects that would solve the needs of both — small and large users. Tools like `pip`, `pipx`, `hatch` and others were there, but with the capacity of the team and their way of working — including accepting contributions from various contributors, could not produce a complex, coherent set of tools that solve all the possible edge-cases and issues. So the team did something that I see as a very smart choice — instead of developing the tools themselves, they focused their efforts on developing standards that were supposed to enable others — commercial players, other volunteers, etc. to develop their tooling that would fit and follow the standards, and even introduce their own ways where the standards were not there yet. And they worked very closely with those who developed those tools — keeping a firm grip on the standards, while recognising the needs of users and tools that those standards should follow. That was a deliberate, albeit slow effort, however “slow” in this case meant “well thought out and discussed”. In the last few years many foundational PEPs have been discussed, proposed and approved to lay the foundation for Python projects to use them, while tool authors to build on top of them.

Python Packaging team discusses PEPs (fictional)

This led to significant improvements in packaging standards and it made us not only look at the possibility of removing our custom code, but actually doing it and being more hopeful that the standards are catching up with our needs. That made airflow the first step of improving the structure — switching to the modern. standard ways of configuring our building — pyproject.toml, separation of packaging tools into frontend and backend, and many many other PEPS . The most important of those PEPs are listed in the appendix at the end of the post.

But we needed something more — the key innovation he introduced was the concept of a “workspace” — a unified environment where multiple, interconnected Python projects can coexist and be managed efficiently. And while there were several other solutions that attempted to address the problem, none of them built on those modern PEP standards. Many of those solutions required specific ways of writing your code.

Here were some of the work building on top of the standards, and investing well VC money stepped in. We were one of the first users of ruff and uv that came from the Astral team — the speed and development experience they provided, were just fantastic. And the fact that they are based on well established and PyPA-approved standards made it possible for us to rely on what the tool provided, knowing that others are also implementing the standards and that we can trust that we are not vendor-locking ourselves (which would be quite a bit of an irony — where Apache Software Foundation’s project focus is on making sure our users are not vendor-locked).

Charlie Marsh, the creator of uv and ruff at Astral, shared in many of his interviews and blog posts that his work was driven by a core philosophy: make tools “easy for small projects, scalable for huge ones.” He described how the Python ecosystem a few years ago lacked the robust, performant tooling needed to truly empower large-scale monorepos, and packaging felt fragmented..

When we discussed our needs with Charlie and the Astral team — they stressed that uv was designed to tackle these challenges head-on, focusing on speed and user experience, which is critical because developer friction impedes productivity, and workspace was one of the necessary features to address the challenge of the “big projects” — like Airflow in open-source space, but also many of internal projects in enterprises, where most of big deployment involve glueing together a number of separately distributed smaller projects. We shared our needs with the Astral team, had a few discussions with them and explained what we would need to be able to use the workspace feature. We also spoke with Ofek — hatch maintainer who also thought about adding a workspace feature, but being volunteer driven, we knew that chances are that uv will be probably earlier. Several months later the initial version of workspace feature materialized in uv. Initially in the docs even Airflow terminology of Providers were used, which shows that Airflow was seen as one of the important users. A few iterations later we gave it a shot and… we are full-on-workspace now.

Here we are, If you go to airflow repository, you will find this:

Find command showing 122 pyproject.toml files in Airflow’s repository

Yep. We have 122 distributions in our repository. And yes, thanks to the uv workspace feature, we can work in either one of those or with all of them together.

We can now truly do what we always wanted:

each distribution in the same repo is separate and isolated
as a contributor, you can work on each distribution separately (cd distribution-folder; uv sync) — with only using explicitly declared inter-dependencies
you can also work on the whole project and run tests and various checks and refactorings together
you can also have internal distributions that serve as an “integration layer” — where they rely on several other internal distributions — while we do not extensively use that, it makes it possible to have “integrations” of separate components as separate distributions that “bring them all together”. We currently have “airflow-core” and “task-sdk” as such “integration” layers — even if they bring their own functionality
code does not leak between distributions. After uv sync, you simply cannot import and use the code you have not declared as dependency
we can easily add more and more distributions — they are fully standard Python distributions, each with own pyproject.toml, IDEs understands the code and dependencies, and all the modern features of code introspection, auto-completion, agentic AI development works out-of-the-box

If you want to check how it is done in airflow. you can take a peek a the most important part of the workspace definition — the main pyproject.toml defining the workspace: https://github.com/apache/airflow/blob/main/pyproject.toml#L1330

What you find there, is basically all you need to declare your workspace, all your distributions inside are simply standard Python distributions, with standard pyproject.toml files, each with each own “src”, “tests”, “doc” folders and basically everything you need for those distributions to be “standalone developed” as isolated projects. The distribution only reaches out to other distributions in the repo that it has explicitly declared in the `pyproject.toml`.

All that you need to develop such distribution is:

Running single distribution tests in isolation from other, not used by the distriburtion, monorepo components

This is all you need to get tests of the distribution run with all the dependencies (and only those dependencies) it needs — and nothing else. This also makes sure your virtualenv has all the dependencies — including development dependencies — that are needed to develop the distribution. the devlopment dependencies are added thanks to “dev” dependency group that uv autoamatically adds when you run uv sync.

And the best thing — this workspace feature serves as a model for other tools. The `hatch` tool mentioned before and developed under the Python Packaging Authority has now also a workspace feature that was largely developed looking at the ways how `uv` did it and learning from some decisions made there and the ways how Airflow and other projects started to use the workspace feature. And — we could very easily switch to hatch if we wanted to,

But this was just the beginning. While we had great way to do development and run tests, we had some more needs and tools we use — namely, we wanted to be able to make our pre-commit hooks (170+ of those) also split across the distributions — to also keep the modularity, and then we had even more complex need of shared code that we want to use across different distributions, but we did not want to release those libraries as regular Python distributions and manage all the versioning between those libraries.

This is explained in:

Appendix. The list of PEPs that shaped Python packaging landscape in the last years that we made sure Airflow follows.

PEP-440 Version Identification and Dependency Specification
PEP-517 A build-system independent format for source trees
PEP-518 Specifying Minimum Build System Requirements for Python
PEP-566 Metadata for Python Software Packages 2.1
PEP-561 Distributing and Packaging Type Information
PEP-660 Editable installs for pyproject.toml based builds (wheel based)
PEP-621 Storing project metadata in pyproject.toml
PEP-685 Comparison of extra names for optional distribution
PEP-723 Inline script metadata
PEP-735 Dependency Groups in pyproject.toml

Modern Python monorepo for Apache Airflow Ⓡ — Part 2 was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modern Python monorepo for Apache Airflow Ⓡ — Part 1

Jarek Potiuk — Sun, 14 Dec 2025 20:42:57 GMT

Modern Python monorepo for Apache Airflow Ⓡ — Part 1

Part 1. Pains of big modular Python projects

This series of blog posts post describes the journey of devex and development and packaging tooling in Apache Airflow Ⓡ, in the context of building a huge project with multiple components, where each of those components is pretty much independent of each other and can be worked on separately, but they also have to work together, be tested together and sometimes you want to make changes together on many or even all of those components together.

Airflow Workspace with UV and Prek

This series consist of 4 parts:

Let’s dive straight in in part 1

Challenges of huge modular projects

Traditionally in Python — and other languages, you could attempt to solve this by having separate repositories, and treating those repos as independent and developed separately, and while it helps in case you (or your team) want to work on those components independently, it has a lot of challenges when it comes to bringing those components together. When you work separately on those projects and you want to bring them together, often integration effort required for those separately developed components is just huge or often almost insurmountable

On the other hand trying to keep everything in one repo and source tree has other challenges — code and abstractions are leaking between components, they implicitly start depending on each other, often you end up with spaghetti code that goes across all those components, and quickly you stop understanding what’s going on when your logic is spread across the whole repo. When you want to install different versions of components at the same time and they depend on each other in implicit ways, that becomes simply impossible and you end up practically with a multi-component setup that pretends to be modularised, but in fact is a giant monolith.

Can we eat cake and have it too ? Let’s find out looking at the journey of Apache Airflow where we always followed the monorepo approach and developed custom ways of handling independent components, but due to the recent improvements in Python tooling, it became actually **easy** to eat the cake, and still have it — i.e. have a truly modular application that is kept in a single monorepo and you can work with either the parts or all of it with equal ease, and integration is simply embedded in your daily work, so you do not have to pay separate price for it.

Eating the packaging cake, and having it too

The Monorepo Challenge in Airflow

Managing the CI/DEV environment for Apache Airflow has been my focus for the last five years. Airflow is a decade-old, colossal project. Its sheer size is daunting, featuring over 700 dependencies — a number that often makes Python developers uneasy. While it began as a monolith, we undertook a significant effort in 2020 with Airflow 2 to meticulously separate it into approximately 60 distinct distributions, encompassing Airflow itself and its various providers.

But as the project grew, so did the pain. We’re now releasing close to 100 distributions often twice a month, and the need to further modularize the “Airflow core” became evident. The problems were clear: reduce complexity, improve maintainer and contributor experience, and embrace a more modern, scalable approach. The community aspect is crucial here — a monorepo offers a unified development experience and shared infrastructure that’s hard to replicate with disparate repositories for a project of Airflow’s scale. At the same time, with projects of this magnitude, people are usually focusing on a small part of it and being able to modularise it and separate it in the way where you could only laser-focus on a particular part of Airflow, while keeping everything in-sync together - is crucial.

If you look at our PyPI repository — you will find that we have 146 projects (distributions) now.

And all of them come from the single repository: https://github.com/apache/airflow

We build and release those distirbutions regularly — from different branches and we need to make sure that they are isolated but also that they work together when we install all of them in a single virtual environment.

How we are solving the challenges — head to those parts to find out:

* Part 2. Modern Python packaging standards and tools for monorepos
* Part 3. Monorepo on steroids — modular prek hooks
* Part 4. Shared “static” libraries in Airflow monorepo

Modern Python monorepo for Apache Airflow Ⓡ — Part 1 was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unraveling the Code: Navigating a CI/Release Security Vulnerability in Apache Airflow

Jarek Potiuk — Wed, 13 Dec 2023 00:11:48 GMT

Introduction:

In the ecosystem of open-source development, where lines of code interweave and developers are using more and more complex tools and processes to build and release them, projects like Apache Airflow navigate a delicate balance between innovation and security.

Recently, the project faced a case of a vigilant bug bounty hunter who discovered a flaw within the GitHub Actions workflow. This blog post sends you on a journey — from the revelation of the vulnerability to the meticulous remediation steps taken to fortify the project’s defenses.

Software supply chain security

The Discovery:

The narrative unfolds with the discovery of a detailed bug bounty report, titled “Code Execution in Github Actions workflow allows secret exfiltration.” This report became the lodestar, pointing towards an anomaly within the execution of CI commands during the crucial release preparation phase. Specifically, commands used during the package preparation such as breeze prepare-provider-packages and breeze prepare-airflow-packages came under scrutiny.

CI and release build context:

To appreciate the gravity of the vulnerability, let’s delve into the intricacies of Apache Airflow’s release preparation process. These tasks, integral to the project’s package preparation, traditionally use the breeze command that is also used during CI jobs. The reason for having dedicated commands is simple — reproducibility. The breeze development environment is a wrapper around common actions executed in CI, development environment and release process, where docker containerization strategy was adopted for various reasons— avoiding “works for me” syndrome, eliminating the need for separately maintained virtual environments, facilitating the handling of complex multi-step operations and long commands to build and run containers for testing, and ensuring a controlled environment during tasks like executing setup.py for airflow packages.

The Vulnerability Unveiled:

The crux of the matter emerged with the discovery of a subtle yet impactful typo in the GitHub Actions workflow. This inadvertent oversight allowed code from public pull requests to escape the confines of the container during the pivotal “Build image” workflow, — which by its nature — had to have a write access to Github Container Registry — in order to share cached docker images. The docker cache speeds up immensely the tests in CI and is essential in setting up the local development environment. While seemingly innocuous, the true peril lurked in the sophisticated realm of Docker cache poisoning.

The Danger:

Docker cache poisoning took center stage as the exceptional danger stemming from the vulnerability. In essence, a malicious actor wielding write access to the project’s cache could manipulate an image. This manipulation involved injecting seemingly legitimate commands while discreetly infusing altered binary code — an elusive act that made detection challenging. The ultimate risk materialized in the compromise of the release manager’s image, potentially introducing malicious code into the packages slated for release.

Assessing the Risk:

Being part of the Apache Software Foundation, the Apache Airflow Project has a very sound process of release preparation and verification — with multiple, independent PMC members (minimum 3 of them) performing verification of the released artifact’s provenance and integrity. They employed a multi-stage process where PMC members verify signatures, checksums, licenses in the released artifacts and check if the packages are generated from the sources that are tagged with a signed tag in Git repository. So it was likely that any attempt to tamper with the process there would have been caught at the verification process. While the report had shown that the vulnerability was real, other safeguards still held. It was not as bad as it could be if our processes were not sound, documented and meticulously followed — as the ASF processes mandates.

Assessing the risk landscape, it becomes evident that while the danger wasn’t immediate, the potential for exploitation loomed large. The hypothetical attack scenario involved a bug within our CI, an anonymous pull request from a remote PR that did not have to even be approved, and meticulous manipulation to avoid detection — a sophisticated dance that was complex to perform, but it could be attempted in case targeted attack against the most popular workflow orchestrator — used by tens of thousands bigger and smaller users all over the world.

Funding security improvements of Apache Airflow:

So it happened, that the issue has been reported while a team of individual contributors and PMC members got funding by the Sovereign Tech Fund to improve security and release processes of Apache Airflow as part of the Contribute Back Challenge (Round 1) . It has been announced at the ASF blog — initiative which is rather important in the light of upcoming security regulations, and is one of the components of long term strategy on open-source. Since regulations in this area are coming in multiple regions — for example CRA in Europe being at the last stage of negotiations, it’s more and more important that there are various models of funding “ground security-focused work” — that might otherwise be seen as afterthought where people working on OSS projects are mostly focusing on “new features” to develop.

This was a very nice coincidence because the individuals that got the funding have been focused on the very subject and also had the time reserved, plans in progress and money to support the investment to actually implement a lot of improvements in the process to address the issue.

Remediations Implemented:

Now, let’s unravel the layers of strategic remediations that were meticulously implemented to fortify Apache Airflow against analogous threats.

Reducing Reliance on CI Image: A paradigm shift in approach was proposed — reducing heavy reliance on the CI image. The contemplation of leveraging a “generic” Python image was introduced, aiming to achieve the same level of isolation without inheriting potential risks associated with a compromised CI cache.
Process Improvements: Critical process enhancements took center stage. Release managers, entrusted with critical tasks, cut out reliance on the CI image. They were given a new process where reproducible builds. official Python images only and local environment were used rather than shared, remote binary CI images. This deliberate move minimized the risk of unauthorized modifications during the vital release and verification process.
Reviewing the build tooling: Review of the tooling of ours used in CI had shown that our Docker isolation reliance was not something that the Airflow team could depend on entirely. In 2020, Platypus Attack had been revealed which allowed an attacker to steal secrets from the machine they were running on, and while it has been addressed in general cases, Docker/Containers turned out to be potentially vulnerable to stealing secrets from the host, the containers were running on — Docker released the security advisory on it in October 2023, and once the Airflow team realized that it undermines some of the assumptions of ours regarding this scenario — upgrade to the latest versions of Docker that addressed the vulnerability by disabling access to powercap device happened immediately.
Enhanced Verification Methods: A robust upgrade to the verification process was introduced. By incorporating reproducible builds and non-shared-container builds, the PMC members now have simpler, more robust ways to verify the provenance of generated code. Local verification, combined with comparisons against GitHub tags, rebuilding the packages in a reproducible way and byte-to-byte reproducibility added an additional layer of security.
Retrospective Inspection of Past Releases: Acknowledging the potential vulnerability duration, a meticulous retrospective examination of past releases was recommended. This involved comparing code in historical releases to the current state, playing detective to identify and rectify potential tampering — an exhaustive yet imperative audit trail for added security.
Future work: While Airflow already hardened and improved the release process, the work is not complete yet. While Airflow Provider packages already have reproducible builds, the core airflow package is not yet there, more changes are needed and incorporating modern Python tooling to make it happen, but the team is on on a good track to get there — in this, and hopefully next, round of the “Contribute back challenge”.

Conclusion:

In conclusion, the journey from vulnerability discovery to remediation stands as a testament to the Apache Airflow community’s unwavering commitment to security and code integrity.

This blog post serves as a valuable lesson in vigilance, collaboration, and the continuous pursuit of excellence in code craftsmanship. As Apache Airflow navigates the intricate landscape of open-source development, this blog post stands as a beacon for other projects — urging them to maintain a vigilant stance, foster collaboration, and elevate their code’s resilience. The supply chain and build and release process of many open source projects has potentially flaws and weaknesses that could be exploited by malicious attackers.

It’s crucial to keep a tab on your project’s build and release process and tooling. Having funding for individuals who are experts in the projects they voluntarily contribute to is also helpful in making it happen. Even in well established and mature projects there are often things that can be improved and hardened, extra layers of protection can be added and continuous vigilance, quick reaction to raised issues and time to perform deeper analysis are necessary to keep up with security challenges.

Credits

The credits for finding the original issue go to Harish (@d3ku100 on hackerone) — not only for reporting it but for being persistent in explaining the issue and helping us to verify fixes once we applied them,

Appendix:

Due to limited size, the article does not dive deeply in details of the vulnerability and remediations, but since Airflow is an Open Source project, those interested in deeper-dive are free to take a look at some of the Pull Requests that implement improvements mentioned in this article. Also if there is enough of an interest, I might write a more detailed post diving deeper in more details of the problems discovered.

Fixing the original typo that caused the issue:

Switching to Python Official images for builds

Switch building airlfow packages to generic images instead of CI image by potiuk · Pull Request #35739 · apache/airflow

Modernizing package building and reproducible builds

Upgrading Docker to avoid Platypus attack

Bump version of docker/docker-compose for stability · apache/airflow-ci-infra@9e31fb0

Unraveling the Code: Navigating a CI/Release Security Vulnerability in Apache Airflow was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data Engineering @ Community over Code conference

Jarek Potiuk — Mon, 03 Apr 2023 17:00:34 GMT

As a follow up from last year, together with Ismaël Mejía we are co-chairing the Data Engineering Track at Community over Code NA (former ApacheCon) in Halifax, Nova Scotia, in October 2023.

You can see the videos from last year:

https://s.apache.org/data-engineering-videos-2022

Following last year ‘s Data Engineering we wanted to build on what we’ve learned last year and show our vision for the Track.

Why Data Engineering track and why @ Community over Code ?

In the last decade, many distributed databases and open-source projects emerged for processing data at scale. They’ve quickly become the standard tools we use in the industry and became the backbone of modern data processing. However, processing data is not the only task we need to build a reliable and consistent data platform.

The Data Engineering track is about the open-source tools and libraries we use to clean the data, orchestrate workloads, do observability, visualization, data lineage and many other tasks that are part of data engineering. It is about the often-unheard open-source tools that are part of (or integrate with) the open-source data ecosystem and the role they play in the modern data stack.

Call For Presentations closes 00:01 UTC on July 13th, 2023, so there is quite some time yet, but we encourage you to submit your talk now, rather than wait for the last moment!

Why do we think focus on Data Engineering is needed ?

In the world of big and small data — data is the king. Fast crunching and processing the data is at the heart of every business. There are plenty of open-source tools that focus on data processing, and they do their job marvelously. Each of the tools is a stepping stone enabling Data Scientists to make good use of the data.

However, to crunch the data in all kinds of organizations in a consistent and repeatable way, you need some ways to keep your data processing processes in order. Cleaning the data, visualization. orchestrating workloads, observability, data lineage and discovery and generally — those are not easy tasks, tasks that on the surface might look trivial or non-essential, but as your business scales, they are all indispensable for any business, and you need tools and platforms that are engineering — focused rather than data-focused in order to get your business scale without hiccups.

The Data Engineering track is all about the indispensable tools you need to use in order to get your data under control. You don’t often hear about the tools and platforms used to keep your data in check from the data scientists and analysts. The goal of those tools is to be invisible and do the job. If your data engineering tools did a good job — you rarely talk about them. So let’s talk about the Data Engineering tools and explain the role they play in a modern data stack.

What projects fall into the Data Engineering umbrella ?

We think there are many projects that deserve the attention of Data Engineers. We prepared a selection of such projects which we found relevant. But feel free to bring more of such projects to our attention. Let us know in private messages or comments if you think some projects deserve to be added to the list. Naturally we come from the Apache Software Foundation and the ASF project get first in our mind

The ASF projects:

Airflow — https://airflow.apache.org/
Atlas — https://atlas.apache.org/
Beam — https://beam.apache.org/
Datasketches — https://datasketches.apache.org/
Dolphin Scheduler — https://dolphinscheduler.apache.org/
Hop — https://hop.apache.org/
NiFi — https://nifi.apache.org/
Skywalking — https://skywalking.apache.org/
Superset — https://superset.apache.org/
Zeppelin https://zeppelin.apache.org/

Non-ASF Open-Source projects

Amundsen and other Data Governance and Discovery tools
Metadata: OpenMetadata, and others
Marquez and OpenLineage, Data Observability tools
Great Expectations, re_data, and other Data Quality tools
JupyterHub/Python (Notebook management)
Multiple Data Visualization Tools
Prefect, Alluxio and other orchestration tools
Your own — not yet known — tool that integrates with the ecosystem

There are also multiple non-open-source projects in this space.

If you want to submit your talk and share your experience, reminder:

Call For Presentations closes 00:01 UTC on July 13th, 2023, so there is quite some time yet, but we encourage you to submit your talk now, rather than wait for the last moment!

Shared volumes in Airflow — the good, the bad and the ugly

Jarek Potiuk — Mon, 25 Jul 2022 08:18:00 GMT

Shared volumes in Airflow — the good, the bad and the ugly

This is my highly personal take on using shared volumes for Airflow to share DAG files (and Plugins — but I will use DAG files to shorten it) between Airflow components.

I know this might be a controversial subject — I shared my view with a number of people in Airflow Slack and Github Issues/Discussions, and I know what I write here might be seen as controversial.

But hey, medium blog post is a nice way to express your thoughts in — hopefully — clearer way than an ad-hoc discussion, and the blog is mine so why not to share my opinion here.

I think it’s a good opportunity to describe why I think shared volume is often not the best choice for sharing DAGs and Plugins in your installation. I hope after reading it, you will understand when it might make sense but also when moving from shared volumes to Git Sync might be a good idea.

My view on the subject is that while shared volumes are easy and good to start with, eventually as your airflow installation grows and matures, moving to direct Git Sync is the best approach.

Possible evolution of Airlfow share volume approach

Context

Shared volumes is one of the ways you can share DAG files (and Plugins) among your components. There are few ways as we described in the documentation of our Helm Chart you can use. There are a few other ways not mentioned in our Helm Chart documentation (because the Helm chart is cloud-agnostic). Managed Deployments of Apache Airflow offer an object-storage (S3/GCS) synced solutions for example. But those cloud-based solution are equivalent to using shared volumes(in our Chart implemented by PVCs — Persistent Volume Claims).

The possible options for DAG sharing essentially boil down to:

pre-baking the DAGs in the Airflow Image
sharing the DAGs via shared volumes (that’s the PVC/GCS/S3 approach)
using Git-Sync to synchronize your DAG files

Pre-baking the images has some obvious drawbacks — mainly that you need to redeploy the image to change/add DAG files. But I personally think that general benefits and usability of “shared volumes” is quite overestimated by many users. And often unknowingly, they stick to themeven if their installation grows and requires better engineering practices — and “shared volumes” are rather an obstacle than help.

The good

Let’s start with the good things. Why shared volumes might be a good choice?

Simplicity for your users is the main thing that comes to my mind for those. What’s easier to our users than “folders’ ‘ that they can “drop their files in’ ‘? There is nothing simpler. Just dedicate a volume on a shared network that they can drag & drop their files, or run a cp command to copy the files — and in a few seconds (or minutes but we will get to that shortly) the files will magically appear in DAG folder or scheduler and workers and get executed. And if your users are mostly data scientists, who are used to iterate and change their files locally and experiment and quickly deploy stuff by just copy pasting they do not need to learn any new tools, nor follow any rigorous deployment workflow — hey we just copy file here and … it works.

Sounds cool? Yeah. Because it is cool.

This is how Shared Volume work

When you have a small-ish installation and a handful of DAGs that are mostly accessed by one user, this is a perfect solution. And yeah, in such a case I’d heartily recommend deploying Airflow with shared volumes.

The bad

But there is a nasty side of that that you do not see at first, but when your orchestration needs grow and your team grows, start to show their nasty “Hydra-like-heads”. As a software engineer without much hair on my head left, I still recall the times when we did the same with our software.

That was not a long ago (you will likely not believe it but it was in the 21st century)when I started to work in a small startup (I will not mention the name here) and to my utter disbelief I found out that we were editing the code directly on a shared volume on a large “company server” without any version control. And the startup owner (software engineer) claimed that this is “enough”. He was perfectly happy with “we have the regular backup” and “our server has a disk array to keep it”. Yep, 21st century it was. If you write any software in any company nowadays, you would be quite surprised to see it happen. I was even back then. (Side story — next day I introduced — modern then — Mercurial — to keep a bit of sanity in my new position). This was just scary not to have a modern control over your software development practice.

Hmmm,— does it remind you something?

Yeah, DAGs ARE a code. If you keep them in a shared folder, how do you keep track of what happens with the DAG code? If you have a team of people working on the same set of DAGs and share some code — how would they solve the conflicts? Would they override each other’s code? If you manage a team of people working on your head — wouldn’t you start tearing hair out of despair what might happen if they DO start overriding each other’s code (I’d certainly start if I had any hair left).

Enter Data Engineering Best Practices.

Last few years we’ve seen tremendous growth of maturity in this area — following what happened in software engineering few decades earlier. Airflow is one of the best examples — there is a reason why it is the most popular, truly open-source orchestrator in the world of data engineering — because it promotes good data engineering practices, and it is actually one of the most popular data engineering platform out there in general.

And you can even see how Data Engineering Best practices became one of the most important subjects for Airflow users. If you look at the talks of Airflow Summit 2020, Airflow Summit 2021 and finally Airflow Summit 2022 — you will see how “Data Engineering Best Practices” are maturing — first as “something we want to do” then “something we attempt to do” and in the final year “something we already do and BTW. if you don’t — you are behind”. I know for a fact, because as one of the organizers, I watched ALL the talks, so you don’t have to for ALL summits, and every year i am amazed how our users mature in terms of engineering practices.

Shared Volumes does not help with those practices. While in some cases (versioned object storages) you can keep track of the history of uploaded files at most. But there is no way to see what changed, who changed and when. You have no idea which version was used at any given time. you do not know if someone introduces a “fat finger” typo in any of your DAGs.

What you then start to do — you introduce those practices (and this is generally a very good idea). You start to keep your DAG files in — usually Git, you start to track the history, you start to see who changed what, possibly (and that’s highly recommended) you introduce code reviews, and maybe even (this is fantastic if you do) tests in CI that fail if you detect some DAG problems.

All this is great and if you are growing out of the “small orchestration needs” — this is very important for your business to apply those practices.

But then there is the next step — you probably still move the dags to the shared volumes. And I saw many ways of doing it — someone manually syncing the files, automated scripts that packaged the files and unpacked them to the shared volume, even recently I’ve heard of automated process that regularly sent the files over SSH connection to an AWS instance to put them on EFS shared volume.This all seems complex and brittle.

Many organizations — when introducing the good engineering practices — only do it in the “DAG authoring” part.

Good Engineering Practices AND shared volumes

But what if, the organisations also introduce at the DAG distribution side.

The initial thought when you introduce good Data Engineering practices are:

“How can we put git files into the shared volumes of Airflow?”

But I rather think that in many of those cases the question should be:

“How can we put git files and send them to Airflow?”

Notice the lack of a “shared volumes” there? Yeah, that’s intentional. Shared volumes are not necessary in this case, and they actually get in the way.

Enter Git-Sync

Git Sync actually fits in here very nicely.

It removes the middle-man shared volumes completely.
It allows all Airflow components to independently synchronize their code with the GitSync “repository” in a way that is very efficient (Git was created to store and distribute changes in the code) and very flexible.
It allows to plug-in into your DAG development workflow. You can set designated branches to be “Releasable’’, you can tag the releases and keep track of what has been deployed when.
You can combine DAG code coming from multiple independent repositories into single one via submodules and

Git-sync is perfect fit for all the modern Data Engineering Practices to make your DAG code directly deliverable to your Airflow.

This is how Git sync works

Do not just believe my words. I am but a humble commiter of Airflow — and you might be surprised that I do not really run Airflow in production myself. But — maybe you will believe other users. Here is the fantastic “Manage DAGs at scale” presentation from the Airflow Summit 2022 where Anum Sheraz from Jagex described how they mange190(!) Git DAG Repositories (and are extremely happy with this setup).

So — if you have not thought about removing the shared file system from the picture — you can, because you do not need it any more when you start improving your engineering practices.

But, this should not be the reason for you to switch. There is the famous saying “The fact that you CAN do something does not automatically mean that you SHOULD”. Let me argue why you SHOULD.

The ugly

There is one really, ugly part of using Shared Volumes, that you don’t realize until your DAG file number grows and your team grows and you start having a lot of the DAG files and they start to change a lot.

The problem is, that stability, when you grow, can usually only be bought by (much) more money that you have not anticipated (and it also has limits).

If you use cloud (who doesn’t nowadays) and you are bought into your cloud platform you would not really deploy your own file system. You should deploy something like (for example) Amazon EFS. Let’s stick to this example, but this chapter applies pretty universally to many other shared volumes like that. It turns out that the more DAGs you have and the more they change, you will quickly find out that the very basic, “almost free” offering of the EFS is not nearly enough.

Many of our users who had stability problems with their Airflow tracked it down to stability and performance of the underlying filesystem. After some periods of instability they bought many more IOPS and poof! magically their Airflow stability became rock-solid.

Why is this so?

Partly because the customers believe in the magic of shared volumes, and partly because Airflow uses DAGs folder in the way that reveals that in-fact shared volumes are not magical at all.

You need to understand what happens under-the-hood when you use a shared volumes. The shared volumes (EFS-including) provides you ILLUSION (hence the magic) of something that works like a local filesystem. In most cases that illusion seems to work — you display the content of the folder, you can read and write your files and all seems to work as if your files were right there on your local disk.

I am afraid — I have to act a little nasty and break the illusion. The illusionist is just pretending this is happening. In fact there are a lot more things going on and there is absolutely no escape that the files are actually pulled over the network from some kind of storage which is actually — bear with me — somewhere in the network at a different machine (or usually distributed among multiple machines). The details of that differ in different filesystems, but there is no magic (or rather “Any sufficiently advanced technology is indistinguishable from magic” — quoting Arthur C. Clarke).

EFS under-the-hood uses NFS (Networking File System). While the “Elastic File System” name is cool, and it’s solved as “serveless” solution, in fact it has servers in Amazon Network, but they are managed by the AWS team. If you are interested here is a nice description on how NFSv41 works — EFS simply uses the NFS (which is standardised by the IETF as RFC3530).

Airflow Scheduler works in the way (and this is by design) that it continuously scans the DAGs folder and reads all the files there. Continuously. Non-stop. All files.

Let it sink for a while.

This basically means that all the time your EFS is bombarded by requests of scanning and reading DAG files. All the time.

If you look at the NFC protocol implementation — all the communication with the servers happen via Remote Procedure Calls (RPC). And they are serialized. This basically means that the more small requests happen, the more serialized the communication is. NFSv41 has a good support for bundling those together, but when you have continuous scan/read/commands for multiple files, this can only help a little.

This is how NFS (and EFS) looks like under-the-hood

NFSv41 has a clever trick — the server can grant delegation to a client for particular files — based on the access patterns — which makes it possible for the client cache files and act on the files as if it was accessed locally. But the problem with this feature is that you cannot control it from the client side — it is entirely server-based — and there are many factors that can break it (for example delegation does not work at all if you are behind a NAT gateway as it requires server callbacks). But this is not a well known fact and you have neither control nor generally the knowledge on whether it is used or not.

Even assuming your local EFS has some cache to store the files — if the cache it has locally, is not enough and when you have more files in your folder, it means that the local cache will be continuously evicted and the files — even those that that were “delegated” to you, will be re-downloaded again. From the user perspective it is a bit of magic. You open a file, read it and it looks like the file was locally available, but with the pattern of Airflow scanning and rescanning the folder and re-reading all the files continuously, all the files might have to be actually continuously downloaded over the network.

Even if you look at AWS EFS performance tips they mention that local file caching might be enabled but it has no impact on latency, which means that anyhow the EFS servers are contacted with every single access to every single file:

The distributed nature of Amazon EFS enables high levels of availability, durability, and scalability. This distributed architecture results in a small latency overhead for each file operation. Because of this per-operation latency, overall throughput generally increases as the average I/O size increases, because the overhead is amortized over a larger amount of data.

There is also an interesting observation you can make (this is a little side-comment). You know what you are paying for when you use EFS (and generally other similar distributed volumes)? Yes — it’s mentioned above — reliability, durability, scalability, distributed architecture. This sounds really cool. But … think for a while. If you you follow good engineering practices, and keep your files in a “solid” Git server do you ACTUALLY need any of those? Your Git Server is already reliable, durable, scalable. It is likely already highly distributed and accessible fast from wherever — (if not, then, due to Pandemic, and your employees being distributed all-over the world you should generally have it distributed). Do you need any of those on your filesystem that JUST keep the snapshot copies of something you already version, store, and backup? Do you really think you should pay for all those features you actually don’t need?

Coming back to the main topic — the DAG files — by their nature — are rather small. Or should be if they are not. DAGs are a code and the Good Engineering Practices (when you apply them) are very explicit about it — keep your Python modules small.

This basically means that your EFS needs all the IOPS it can to deliver the scalability, reliability and distribution (none of which you actually need) in order to sustain the constant pressure. The more your system will grow, the more you will experience increased latency. The more good engineering practices for your DAG code you implement, the worse it gets. And when it has not enough of the IOPS — nasty things happen.

Enter atomic updates

When EFS does not get enough IOPS to sync files, what you can observe is that some files are refreshed with some delays. And the problem is that filesystems like EFS do not provide “whole DAG folder” consistency. If you have delays in networking (not enough IOPS), and you have a lot of changes in your files to distribute, it is pretty normal that some files have newer version and some have — older versions.

Imagine your DAG:

Example DAG using shared function

The “my_company.common_code” is a module that is shared between multiple DAGs.

Imagine one of the DAG authors changes the funciton from “shared_function” to “shared_util”. The change is done in both DAG file and “common_code” file and the files got copied to the EFS.

Suprisingly — what might happen next your Airflow component might end-up with the situation that the first file is still old, and the second file is new.

What happens then ? It depends. If you are in Scheduler, you get an import error. If you are a worker, you have task failure. And both cases are pretty mysterious — because you can locally see that all is correct.

This is because a shared volumes do not guarantee atomic changes in more than one file. If it happens you might end up in a situation that your Airflow is not stable and starts to have more and more random failures — the more DAGs and changes you have, the worse it gets.

Now — what happens if your company is growing and you start experiencing it — yep you start to buy more and more IOPS, even if this was something that was not initially needed and anticipated (you anticipated before you need more storage, but you never expected more IOPS).

This is what is almost inevitable to happen when you grow. And I’ve heard this story a number of times from our users.

Git sync to the rescue

Yep. You guessed it. Git-sync is free of both of those problems.

you only need local volumes for Scheduler when you use Git-Sync. Those local volumes can be scanned as fast as you want, as often as you want and you are all but guaranteed that you will not be surprised by anything else but more storage capacity needed if you grow
Git-Sync has built-in atomic updates. You are guaranteed that what you see in your filesystem DAG folder at a given time will stay like that. If your DAG File Processor starts parsing a DAG, it will parse it with a fully consistent version of the DAG and all other files — linked together with the Git Commit they came from.
Your workers benefit from the same atomic consistency and local volume access — once git-sync did the job, there is NO MORE network communication involved when your worker picks another task and parses the DAG file to run your code. This all happens locally within the worker!

There are people who question the performance of GitSync vs. the shared volumes. But I think they discount the fact that Git protocol was — from grounds up — designed to track and share changes to source files and that it is highly, highly optimized for that purpose, also that any modern way of sharing your code (GitHub/GitLab) are build with scalability in mind. Also they discount the fact that Git-Sync only needs to sync changes to DAGs and ONLY when they change. What Git-Sync trades-off is to pull “some changes” rarely vs. “continuous EFS scanning and downloading”. Those two are few orders of magnitude apart.

Conclusion

Ok. that was a bit long one, but let me summarize it:

When you start small and you care for convenience of your DAG authors to upload their changes, shared volumes might be a good option.
But when you grow and want to apply good engineering practices, instead of leaving the shared volumes as middle-man between Git and Airflow, better use Git-sync. You will end up with a simpler, more stable and most of all cheaper solution that will not surprise you with sudden cost increase when you grow.

Let me just summarise it with this image.

Shared volumes in Airflow — the good, the bad and the ugly was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Magic Loop in Airflow — reloaded

Jarek Potiuk — Sat, 23 Jul 2022 19:49:26 GMT

Magic Loop in Airflow — reloaded

Quite recently, we published in Airflow the Magic Loop blog post by Itay Bittan who was inspired by our discussion in Airflow Slack on how he could improve execution delay for dynamic DAG parsing.

This is a follow-up and well, even more magic is involved (or in-fact complexity — but more about it later). I describes how “simple magic” can become even more magical, and finally it gets so magical that you decide to replace the “magic” with “robust product” — because the magic is a bit too, well, magical.

And it comes with the “Community over Code” twist so if you are interested in how true Open — Source community works — read on.

Image credit: https://freesvg.org/benbois-magic-ball

The story begins

It all started with a slack discussion “I see delays in our task execution” by Itay Bittan and myself trying to understand what’s the problem and trying to help. There were a few observations we had during the discussion:

they create many DAGs in a loop (1000s) in a single DAG file
when any task is executed they experience ~ 2 minutes delay on parsing of all the DAGs (because every task execution effectively has to FULLY PARSE the DAG file it comes from
each of the DAGs there is independent from each other and there are no side effects of creation of the images

I suggested to implement a “magic” loop — to skip all other DAG creation when the task is being executed (unlike than DAGProcessor parses the file for schedule). And BTW the “magic loop” name was invented by Itay.

The solution

Not long time passed when Itay implemented it and nicely described how it worked for them — they got whooping 120 ms instead of 2 minutes delay when executing the DAG.

That sounded FANTASTIC return of the simple solution. And my immediate thought was — yeah there are plenty of users out there who have to have similar problem — let’s help them by making it “1st class citizen” in Airflow. I wrote a devlist post about it, after some time got some responses and finally decided to add guidelines in the official Airflow documentation about it — but — since this is the “product” documentation, I wanted to make it a “product” quality.

Now if you are familiar with the “solution” vs. “product” difference, what happened next was exactly reflecting this difference. Over the years. I’ve learned that this is a very, very difficult journey to start with a simple solution, turn it into a reusable one, and finally turn it into a product. The rule of thumb I’ve learned is:

solution to a specific problem = costs you x
making it reusable = costs you 3x
turning it into a product = 3x making it reusable = 9x the original cost

And boy … how foretelling and UNDERESTIMATED it turned out to be this time.

Making it reusable

When I implemented the documentation Pull Request about it, I not only had to describe it, but also review and test the code and what I came up with was not what I was expecting initially.

Initially the code of Itay was — essentially — this:

Original code solving the problem

Looks simple enough, right? I thought, yeah, I will have to add a little robustness, and it should be easy — let me just make sure that all the configuration is well covered. Some of my fellow Airflow committers (like Felix Uellendal) encouraged me to continue, when I posted the first version, but then some vigilance of others — in this case it was Bas Harenslak, kicked in and the question was asked “are all cases covered here ?”.

I eagerly attempted to review and check Airflow code, and … a few hours later, what was literally a few lines of code described by Itay in his post, turned out to be — unsurprisingly — way more than 3x amount of code needed (but it was well tested and quite robust, I thought).

The original code only worked for the very configuration they had at Itay’s company. They used Kubernetes Executor and hacking it to retrieve current dag and task processes was simple. However, Airflow is more than that. Airflow can have Kubernetes Executor, Celery, Local Executor, CeleryKubernetesExecutor, the last three can be run via starting a new interpreter or fork, and there could be custom executors as well. After looking at the code, digging in and running some tests (and few iterations of those) I came up with this:

Reusable solution of the problem.

Hmm.. it’s even more than 3x as complex and in parts it is based on reading the current process title (the only way to read the information in the “fork” case I could rely on). And it was not perfect either. There are at least a few edge cases I could see there which could have happened (not very likely) that would break it (and fall-back to original “all DAGs” parsing(.

But, well, there is also a probability that some of this will actually fail (hence the try/except Exception around).

I proceeded with proposing the PR, but there was something worrying about it in the back of my head. Suddenly the idea of “simple hack” turned into “really complex hack”.

The power of community

While I got some “great”, “fantastic improvement” comments, as part of the review, there were also some other comments.

First — Ping Zhang — one of the Airflow Committers mentioned something that I also wanted to do — we should turn it into a “reliable feature” of Airflow — reliably passing the context in which the DAG is parsed. And yeah I actually proposed it from the beginning. Then Ping mobilized me to make another PR with “proper” implementation (but more about it later).

But then, there was a wake-up call. One of the committers, and PMC member Jed Cunningham expressed his distress about it. First very gently, but when I iterated and asked for more feedback and tried to address what I thought was the problem, he came back withs something like “I am not sure if I am able to express it, but …”.

And after reading it — I couldn’t agree more.

First of all — this is a lot of code to copy & paste, that will survive the PR. If we make it a part of official documentation — it will stay there. It will become part of the product, and suddenly, we will have to start maintaining it and keep somehow the compatibility. Even if I added an “experimental” bit, we all know that those “experimental” bits tend to remain in the code forever. I was already on the verge of “should we really do that” — but Jed’s comments were the last drop.

After reading and re-reading it, I decided — “no we cannot make it into our documentation actually”. But — I can write a blog about it (here you read it actually). We should not make it part of the product — Airflow is definitely a product not a solution and we should treat it with all the seriousness it needs. So I knew I would have to close the “documentation” PR and work more on getting the “product” approach. There you go.

Turning it into a product

Finally the “product” version of the PR looks like this:

Proper “product” implementation

Yep. This is the PR that implements the same feature in the “product” way. This is the change that provides appropriate context to parsing. And it has it all:

it contains all the tests and harnesses that protect against various edge cases
it provides future-compatible API implementation that we will be able to take care about and maintain in the future
it is implemented in the way that even allows to implement “python __future__” kind of approach — so that we could have it backported to an earlier version of Airflow

This is even more than 9x the original code. Yet it also provides our user a very simple and straightforward, Pythonic API that they can rely on when optimizing their DAG.

And this is likely what we are going to go forward with. But it is going to be only available in the next version of Airflow, because even if we want to help our users, we cannot do it in a way that will be unsustainable. Finally Felix commented on the PR something along the lines of “this is so much cleaner and better, indeed it was a bad idea to share it in the original form”.

Conclusions

If you’ve gone that far — one important take away that I think you can take from this post — never underestimate the effort needed to turn a simple solution into a fully-featured product. Even if originally you have a few lines of code, when you want to make it a maintainable, long-term usable product feature, it will often take far more time and effort. Even the 3x, 9x multipliers are underestimated. Also — if you are developing a product — stopping at 3x “reusable” solution is not good enough, and you should resist the temptation to release it when it’s “half-product” or when it makes your product vulnerable to long-term maintenance issues.

But there is a deeper learning and wisdom hidden in the story.

As you might see from the whole story, even if you came up with the best idea in the world and you see great improvements as the result and initially you see that it solves a lot of problems, it might not be the best to get it out immediately. Airflow is the Apache Software Foundation project and “Community over Code” is one of the most important things of the Apache Way and you can see the embodiment of it in this story.

Community over Code

What started from the user who had a problem and raised in the community discussion channel, then came an idea to solve it turned out into a small but useful feature in Airflow — avoiding a few traps along the way. And it happened only because we have a great community of people who collaborate, trust each other and are able to (more or less gently) argue and see different perspectives, and great things happen as result of such a cooperation.

Bouncing ideas from each other, not being afraid of expressing your doubt, and especially — not being afraid to change your approach as the result of feedback you got — this is the true power of Airflow. This is one of the reasons you can rely that what we come up with in Apache Airflow is not the result of a single power, person or organization decision (as happens in other “open source” projects sometimes) but is a result of true collaboration between all the different individuals who ARE the community.

This is what Apache Way all but guarantees.

Magic Loop in Airflow — reloaded was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.

Running Airflow on ARM M1/M2? Hell yes, but upgrade to Airflow 2.3+

Jarek Potiuk — Sun, 17 Jul 2022 12:58:35 GMT

Over the last few months I had a lot of questions and discussions at Airflow Slack and in Airflow GitHub Issues about running Airflow on ARM (which was really all about running Airflow on Apple Silicon (M1/M2). Does it work? Can we use it? What if I buy the new Apple Laptop for my users who want to run Airflow locally (mostly for DAG development)?

Does Airflow support ARM/M1?

In short, the answer is:

ARM/M1/M2+ Airflow = Love (if you are using Airflow 2.3+ that is) — image credit Apple and ARM.

Yes — as of Airflow 2.3.0 there is full development support for ARM devices. The whole development toolchain supports it, and we even have experimental support for running Airflow in Production on ARM. We publish reference, multi-platform images for both ARM and AMD64 platforms.

First things first however. One takeaway from this article that you should remember — If you have not upgraded to Airflow 2.3+ yet and there is even slight change your team has Mac M1/M2 — upgrade NOW!

Small caveat is that it only natively supports Postgres and Sqlite as metadata DB as the ARM support of MySQL and MsSQL is not yet great. But according to our 2022 survey — Postgres is about 80% of our installation base, so if you are still using MySQL or thinking about using MsSQL — well, guess what ?

For those who were previously complaining about Airflow being unstable and slow — yeah, it used to be like that, but now you can reap the benefits of developing Airflow and Airflow DAGs on your brand new MacBook.

Is that it ? Is this the end of the article?

Well, It could be. If you want to just focus on your DAG development on your MacBook you can finish reading now and go do it, but I think it’s worth to understand a bit how it came to that and what it really means, and what you miss if you have not yet upgraded to 2.3+ and want to make your Airflow development experience, not only better but actually “possible”.

A bit of history

ARM is a very interesting company. Not everyone knows that unlike it is with Intel or AMD CPUS, ARM does not develop their own processors. Yep. You heard it right. ARM just designs processors (and generally other chips and system-on-chip (SOC) architectures). It sells the licenses to produce the processors to others.

And it’s a pretty successful model. You might not be aware but almost for sure you have an ARM system in your very pocket right now — virtually all phones run on ARM-based processors and SOC licensed by ARM.

And up until very recently the footprint of ARM in “bigger” devices or on servers was rather small. There were a very few devices and attempts of a number of producers to use ARM in Laptops, Amazon added their Graviton series of servers in 2018/2019 but for all practical purposes — personal computing and servers were running on x86 architecture (Intel/AMD).

Enter 2020

This was — as some say — a year to remember. For a number of reasons of course, you still can think of many things as “before 2020” and “after 2020”. But there is one reason 2020 was a turn of a tide for the ARM architecture as well. Apple finally released the long awaited line of their iconing MacBooks (both Air and Pro) to use ARM. This has been long anticipated, and coming but in November 2020 it finally happened.

When I gave my talk about Production Docker Image at Airflow Summit 2020 — more than 2 years ago and few months before the ARM MacBooks were released, I knew this is going to be next frontier for the image (as you might see the talk was delivered straight from my home office. The “ARM support might be the big one” was the last thing I had to say there.

Screenshot from “Production Docker Image” talk at Airflow Summit 2020

What happened in 2021 for Airflow

Not much on the surface. If you wanted to use Airflow on ARM in 2021 you’d be out of options.

As initially suspected, the transition to ARM images took quite a lot of time for Airflow. Why? The answer is rather simple if you understand how the modern supply chain for open-source software works. For those of you who do not know, it looks like that:

Credits — Randall Munroe https://xkcd.com/2347/

Airflow is very much on top of the picture there. We have more than 600(!) open-source dependencies that make it possible to build the most popular truly Open-Source workflow orchestrator. Managing all the dependencies is a challenge on its own which I gave the whole talk about recently (and not the least problem is that Airflow is both a library and application at the same time). But the problem with Airflow is that in order to support ARM, ALL of the 600 dependencies must have support it. And many of those dependencies are transitive — so our direct dependencies depend on others, they further depend on more dependencies and so on — down to operating system and kernel. And all those dependencies are (as you can see above) thanklessly maintained by some people in the open-source.

Each of the maintainers (who mostly are not directly paid for their work but this is a subject of another, future post) needs time and effort to catch up. When you are developing a library that is a base for tens of thousands or hundreds of thousands other projects that billions of people in the world use, you better do any such change carefully. You need to test all the edge cases, run a CI pipeline on ARM devices, go through a few rounds of release candidates and likely fix some teething problems reported by your “eager” users before you can announce a full support. And that means also that their dependencies need to get the CI support, and CI infrastructure providers need to provide free ARM support (still not there by many) and that the maintainers themselves have a way to test it on ARM hardware (more on it below).

Here is a very, very small fragment of our dependency chain to illustrate it:

A very small fragment of our dependency chain

This is the one reason our Production support for ARM is still “experimental” — we wait for our users to try it out and report teething problems, our CI is — for now- only limited to making sure the Docker images for ARM are building properly but when we start seeing an interested in “Server” support we will also have to enable running tests. Many of dependencies of airflow like “numpy”, “pandas”, “scikit” are huge part of any data processing toolchain and they are developed with performance in mind — so they have to be compiled to platform-specific libraries.

So the ARM support has to “bubble up” in the supply chain. And it takes many months.

Therefore in 2021 I was just observing and trying out how many of our dependencies still need to upgrade, and more importantly — what are the versions of those dependencies we need to support in order to support ARM. Many of the dependencies released ARM-supported versions only in the newest version of their libraries (for very good reasons). For us this means that we had to even work with maintainers of some of our important dependencies to encourage and help them to migrate to those newer dependencies as well (thanks to Flask Application Builder, Snowflake, Apache Beam, Google teams for cooperating on that especially).

How bad was the ARM experience?

Bad. Running Airflow was next to useless on M1.

One thing that also happened — I decided to buy a second generation MacBook M1 to just “feel the pain” and be able to test the future ARM support for MacBook users. The 2nd generation MacBook pro seemed like a great option — Apple finally reverted the TouchBar/no MagSafe/Bad Keyboard/Lack of HDMI decisions (all of them very wrong IMHO). There were other factors involved, some of them tax-related (but let’s not talk about it). Sounded like a great option. I knew at the beginning it would not be “usable” for my Airflow work. But I did not expect what I got. It was unusable. It was so unusable, that for a few months (until I added ARM support for Airflow) I barely used the brand new MacBook at all.

This experience was confirmed and very nicely summarized a few months later when I helped my friend Szymon Nieradka with his Airflow endeavors. Szymon is one of the best Project Managers I worked with in the past (hands down) and he is one of those PMs who have a strong technical background and he is not at all afraid to “do stuff” when this is necessary. So this is what he did on his brand new MacBook in mid-2022 when he tried to use Airflow 2.2.4 (this is the version, the company he managed a huge project for at this time).

Then we had a What’s App conversation where he complained why Airflow is so slow.

Screenshot of What’s App conversation with my friend

Sorry for Polish, but for those who do not read Polish — my “free translation”:

Szymon:

Airflow 2.2.4 “eats up” 100% of 8 cores
Airflow 2.3.2 “eats up” 50 % of 1 core
So I got 16 x boost (not just 10x that was promised earlier in the conversation — translator’s comment)

Jarek:

I told you, your jaw will drop (few floors down — translator’s free interpretation :) )

You might not find it surprising that the company now migrated to 2.3.3 and this was one of the big reasons for this migration.

There comes 2022

The 2022 was another year where a lot of things happened, and this time it was purely man-made, rather than natural phenomena) but for Airflow this was a year ARM support went mainstream. And in April when 2.3.0 was released, my MacBook was pretty much my main development platform for the past few months — all the dependencies that were important to us already caught up. There were a number of people who used it for their “main ‘’ development as well, and all the initial teething problems were already addressed. We managed to migrate to the latest versions of those dependencies and we built the whole CI and development environment we have (Breeze, the development environment I developed for Airflow) was prepared for regular ARM releases and daily use. We converted our build toolchain to use buildx that allowed us to build multi-platform images.

So we could officially announce ARM support for Airflow 2.3.0:

Docker Hub multi-platform image of Airflow

Looking forward to what the future brings

We have a stable CI for development, experimental support for Production running. What’s next?

First thing first — If you are reading this, you are undoubtedly interested in running Airflow on ARM. If you are not on Airflow 2.3+ yet — MIGRATE NOW.

You might think it is not necessary — every now and then I got question — how can I improve my Airflow 2.1 or 2.2 experience on M1? I recently even had a conversation on Slack with a person who was dedicated by their company to make their Airflow 2.2.4 they are running ARM compatible.

The answer is simple. DON’t. It is more effort than to migrate to Airflow 2.3.

Migrate NOW. This will be far simpler than making Airflow 2.2 or before compatible with ARM. I know what I am saying. I’ve been working with Airflow dependencies for more than 4 years, and I can assure you the amount of work that you’d have to do is vast. And not only for Airflow — you would likely have to fork Airflow and make it use some of the newer dependencies, and likely you won’t avoid having to fork and fix some of the dependencies. And if you are using just a small subset of those — you might even succeed, but it will very likely block you from using numpy, pandas, scikit and many other libraries that depend on ARM.

And it’s just much more straightforward to migrate to Airflow 2.3. Airflow 2 follows SemVer rigorously. Not only All Airflow 2 releases are backwards compatible, in Airflow 2.3 we introduced easy ways to move back/forth between the versions if you find that there are some — earlier undetected — errors that prevent you from migration. It should be safe and painless to migrate.

Then when most of our users will migrate — I hope it will be smooth sailing.

Google JUST (4 days ago) announced ARM support for GKE on Google Compute Platform (https://cloud.google.com/blog/products/containers-kubernetes/gke-supports-new-arm-based-tau-t2a-vms) so — inevitably — the server side of ARM vs Intel/AMD race is just beginning to heat up. You can expect many more announcements in the coming months in that space. And … we are pretty much ready for the revolution there. Our production image already (experimentally) supports ARM. But all the heavy lifting has already happened. The only thing that needs to happen now is that we need to make our CI builds run all the test harness for ARM images. We can do it even now, but we just wait a bit when we will start getting questions and feedback about it, as it requires considerable “infrastructure” cost increase (luckily we have some sponsors that make it possible to run our CI workflows).

So — if you want to use ARM and Airflow together — give it a spin. It’s there, waiting for you (after you migrate to 2.3+ that is).

Running Airflow on ARM M1/M2? Hell yes, but upgrade to Airflow 2.3+ was originally published in Apache Airflow on Medium, where people are continuing the conversation by highlighting and responding to this story.