Opteryx Engineering - Medium

When we Stopped Using Regex for REGEXP_REPLACE

Justin Joyce — Fri, 24 Apr 2026 13:20:10 GMT

This post is also published on the Opteryx Engineering Blog

TL;DR

Our REGEXP_REPLACE performance on ClickBench was 10x slower than peers. Swapping regex engines barely moved the needle. The fix wasn’t finding a faster regex engine, it was avoiding regex entirely. We built a specialised DFA and reduced query times to near parity.

The Problem

REGEXP_REPLACE was eating our lunch.

I’ll be honest, when we first started publishing to ClickBench, we were so far from the pack that the performance of one specialized query wasn’t going to close that gap.

After many iterations of the engine, performance of queries like Query 28 being ~10x slower than engines like DuckDB stands out like a sore thumb.

When a single operator dominates runtime, that naively feels like an easy win.

We Tried the Obvious Thing First

First step: swap the regex engine.

DuckDB uses RE2. It’s well known, fast, and predictable. Google wrote and use RE2 in BigQuery, so it’s fast and well-suited to our problem, so we integrated it into Opteryx.

Result? Barely any change.

Maybe 5% in some cases. Hard to separate from noise. That’s not a breakthrough — that’s measurement variance.

Conclusion: the regex engine wasn’t the bottleneck.

The Pivot

If a faster regex engine wasn’t the answer, what was?

Most REGEXP_REPLACE calls we see don’t need full regex — they’re mostly extract or replace calls.

We had brought a regex to a slice fight.

Turns out, you don’t need backtracking to do patterned string slicing. You could use a deterministic finite automaton (DFA).

What We Built

We built a specialised engine, based on a DFA, for the patterns we actually see.

Not a full regex engine. A constrained one. Simple, even.

Design:

compile pattern → deterministic automaton (once, at initialization)
operate directly on buffers (no Python strings)
track the capture points through the pattern
slice the buffer at the capture points to extract the value

The key idea is embarrassingly simple: compile a rule you can progress through the string monotonically to execute.

Results

From our lab ClickBench runs (non-published):

We went from 60 seconds down to 10. Same workload, same query shape, just a different approach.

This isn’t a small optimisation. It changes the cost profile of the query.

Engineering Lessons

One of the strengths of SQL is the engine is free to choose how it processes the user’s request. Here, we were given a regex so we naively processed it as a regex, rather than using regex as the language to describe the intent and choosing the fastest way for us to meet that ask.

Final Thought

The lesson isn’t “build a faster regex engine”.

It’s “recognise when you don’t need one”.

— Justin

When we Stopped Using Regex for REGEXP_REPLACE was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rewriting the Memory Model: Moving Beyond Arrow

Justin Joyce — Fri, 17 Apr 2026 11:30:08 GMT

TL;DR

Arrow helped us get started, but it became a performance barrier in the execution path.

On ClickBench we were near the front of a group of medium-performance engines, but there was a clear gap to the fastest ones. That usually points to something structural rather than something you can tune away.

So we stopped trying to optimise around it and started replacing Arrow with something designed for how Opteryx actually runs queries.

⸻

The starting point

Arrow solved real problems for us early on.

It gave us:

a lot of functionality without having to build everything ourselves
compatibility with Parquet and the wider ecosystem

That mattered. When you’re building a query engine, there’s a lot of value in starting from something stable.

And to be clear, Arrow is a good fit for a lot of systems. If your problem is interoperability or general-purpose data processing, you won’t go far wrong with it.

This isn’t a post about Arrow being bad.

It’s about hitting a limit.

⸻

The problem we kept seeing

As we worked through performance issues in Opteryx, we kept seeing the same pattern.

We could make things faster locally:

tighten loops
reduce allocations
optimise operators

But the gains stopped stacking.

That usually means the problem isn’t local anymore. It means the architecture is doing something expensive over and over again.

In our case, it was moving data between representations.

A typical path looked like:

Arrow arrays as the source
NumPy views or copies for some operations
Python objects where neither worked cleanly

Every step between those had a cost.

⸻

A concrete example

This showed up clearly when we reworked our LIKE operator.

The original path moved data from Arrow → NumPy → Python.

When we processed the Arrow buffers directly, we cut about 7 seconds off a run over 100 million strings.

That wasn’t a clever optimisation. It was just removing transitions.

⸻

The conversion tax

The easiest way to think about it is a conversion tax.

The engine wasn’t operating on one representation end-to-end. It kept crossing boundaries, and every boundary had overhead.

Two things dominated.

CPU overhead

Arrow’s null handling is fine on its own. The problem is what happens when you mix it with Python and multiple execution paths:

extra checks in tight loops
more branching
Python object access creeping into hot paths
vectorised paths dropping back to interpreted ones

None of that is catastrophic individually, but together it adds up.

Memory overhead

We were also holding the same data more than once.

Combinations of:

Arrow buffers
NumPy arrays
Python structures

In theory Arrow supports zero-copy. In practice, null handling and layout differences often meant we couldn’t take that path cleanly.

So we ended up duplicating and adapting data instead.

⸻

We tried to push Arrow further

Before replacing it, we tried to make it work.

We accessed buffers directly. We avoided higher-level APIs. We pushed more work into compiled code.

That helped, but only up to a point.

We kept running into the same issue: even when the data was “zero-copy”, the execution wasn’t. The loops were still in Python, or the control flow still depended on it.

At the same time, the engine was getting more complicated. Each workaround made one path faster and something else harder to reason about.

At some point it became clear we were optimising around the mismatch instead of removing it.

⸻

The decision

Once we framed it properly, the direction was obvious.

If we wanted to move the performance ceiling, we needed to own the memory model used in execution.

That’s where Draken came from.

⸻

What Draken is designed to do

Draken isn’t trying to replace Arrow everywhere. It’s about removing it from the hot path.

A few things mattered:

Keep Python out of tight loops

If something is performance-critical, it shouldn’t be running through Python object machinery.

Iteration, null handling, operator execution — all of that needs to stay in native code.

Stop translating data

The engine shouldn’t need to convert data just to move between stages.

A query engine does enough work already. Converting representations shouldn’t be part of it.

Control the layout

We want the memory layout to match how the engine actually executes, not how a general-purpose format needs to behave.

We still align with Parquet where it makes sense — things like dictionary encoding still matter — but we’re not bound to Arrow’s internal structure.

⸻

Why this changes the ceiling

The gains here don’t come from a single optimisation.

They come from removing whole categories of work:

no Python in tight loops
no repeated conversions
null handling designed for our execution model
memory layout chosen for operators, not interchange

That makes things faster, but more importantly it makes further improvements easier.

⸻

The engineering lesson

Arrow was a good starting point, but it wasn’t the right execution format for Opteryx long term.

The mismatch between memory model and execution model kept showing up as overhead.

Owning the format let us remove that friction.

That’s where the gain comes from — not a faster component, but a simpler path.

⸻

One unexpected win

One thing we didn’t expect: Arrow often materialises dictionary-encoded columns into dense representations.

By keeping dictionary encoding in our internal format, we got a nice side-effect:

fewer comparisons
smaller working sets
less memory traffic

That wasn’t the goal, but it turned out to be a useful win.

Rewriting the Memory Model: Moving Beyond Arrow was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

10x Faster Memory Management: Optimising Opteryx’s Core Memory Pool

Justin Joyce — Mon, 13 Apr 2026 19:15:01 GMT

This post is also published on the Opteryx Engineering Blog

TL;DR

A small, surgical change to the memory pool produced a 10x improvement in allocation/commit throughput. We moved metadata tracking out of Python and into a compact C++ structure, preserved the public API, and avoided a large rewrite. The result: much higher throughput, lower variance, and no behavioural changes for users.

The problem

The MemoryPool is central to query execution: it allocates buffers, manages lifetime, supports zero‑copy reads, and compacts segments. For years the pool tracked segment metadata using Python dicts. That was simple and readable — but slow.

In a tight allocate→read→release loop (the exact pattern used across query plans and streaming workloads) Python hash-table lookups and object overhead dominated the hot path. The metadata lookups were the bottleneck.

The change

We did three things, incrementally and carefully:

Replaced the Python dict used for metadata with a C++ unordered_map.
Moved metadata into a compact C struct (SegmentMetadata) with no Python object overhead.
Kept the public Python API identical; used_segments remains a lazily-evaluated Python dict for compatibility.

The key principle was minimalism: replace just the slow part and keep everything else stable.

Why this works

Metadata access is performance‑critical but implementation‑local. Users call the same APIs; they do not rely on Python dict semantics for internal bookkeeping.
Moving metadata to C++ removes Python interpreter and object costs from the hot path.
Keeping the public API stable means tests, consumers, and integrations continue to work without change.

We also retained Python RLock for synchronization because C++ template types cannot be embedded in Cython classes in our current layout — a pragmatic compromise that keeps thread-safety intact.

Results

Benchmarks (small allocations: 50k commits of 100 bytes):

Old implementation: 12,839 ops/sec
New implementation: 134,104 ops/sec

Improvement: 10.4x faster

This is a meaningful change, not a micro‑tweak — it shifts the envelope for memory‑bound workloads and reduces variance introduced by the Python runtime.

Where it matters

Read cache: we’re planning to use the MemoryPool as a read-caching layer as part of continual IO-stack improvements — enabling hot-block reuse, reducing physical IO, and improving tail latency for common queries.
Morsel exchange: during the execution-engine rewrite the pool will act as the morsel exchange between operators, enabling efficient, zero-copy morsel handoffs and clearer ownership boundaries for execution stages.
Zero‑copy flows: lower latency between producers and consumers when memory handoffs are fast and predictable.
Classic Opteryx: historically the MemoryPool served as the buffer pool; these planned uses extend that role into caching and operator exchange while preserving the same minimal, native hot path and public API.

How we approached it

This was not a rewrite. The steps were:

Profile to confirm the real bottleneck (dict lookups and object churn).
Design a minimal C++ metadata representation and chosen container (unordered_map).
Implement the C++ layer behind the existing Cython/Python bindings.
Preserve the Python-facing API and lazy compatibility layers.
Run the full test-suite and benchmark under representative loads.

The result was surgical: small, reviewable changes with a large impact.

The broader lesson

Optimising a mature codebase usually works best as a targeted, incremental effort. Identify the true hot path, replace the implementation with a low‑overhead equivalent, and keep the surrounding behaviour stable. You get the performance gains without the risk and cost of a full rewrite.

If you’re struggling with latency or throughput in a Python project, look for implementation details that are purely internal state — those are often the best places to move into faster languages without changing your public contract.

10x Faster Memory Management: Optimising Opteryx’s Core Memory Pool was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Making LIKE Faster: From 93 Seconds to Single Digits

Justin Joyce — Sat, 11 Apr 2026 21:31:26 GMT

This post is also published on the Opteryx Engineering Blog

TL;DR

The LIKE operator, particularly the '%needle%' pattern (substring search), went from 93 seconds in Opteryx 0.19.0 to 7.46 seconds in 0.26.2. The improvement came in stages: fixing the IO stack, recognising that '%needle%' is really a CONTAINS check (not a regex), direct buffer access, and adopting the Volnitsky algorithm with a sieve prefilter. We're now at ~2x overhead compared to single-threaded C++ engines, and we're pushing further by removing Arrow and eliminating GIL overhead in the search path.

The Problem

Substring search in SQL is ubiquitous: filtering user IDs, matching partial text, finding patterns in logs. The LIKE operator handles this, but implementation details matter enormously.

Our reference point is ClickBench query 20, an example workload scanning ~100 million rows on Parquet data.

January 2025, Opteryx 0.19.0:

93.27 seconds

That’s unacceptable. For comparison, C++/Rust engines like DuckDB and Clickhouse doing the same work were completing in under 4 seconds.

Profiling showed a large part of the problem wasn’t algorithmic — it was fundamental. The IO stack was stalling execution. Improving string search was pointless if the data wasn’t arriving.

Stage 1: IO Stack Fix (0.19.0 → 0.20.0)

The first 30 seconds of that 93-second runtime was the IO bottleneck. Rewriting the IO stack to use fine-grained byte-range reads cut this down dramatically.

By 0.20.0, roughly one month later:

56.82 seconds

Better, but still nowhere near 4 seconds. The string search itself was still slow.

Stage 2: Recognising LIKE ‘%needle%’ is CONTAINS (0.20.0 → 0.22.0)

Here’s where the insight mattered.

SQL has a LIKE operator that’s technically a regex with wildcards:

LIKE 'needle' = exact match
LIKE 'needle%' = prefix match
LIKE '%needle' = suffix match
LIKE '%needle%' = substring match (CONTAINS)

Our original implementation was treating all LIKE patterns as full regular expressions, delegating to Arrow’s regex engine. This works, but it’s expensive.

The key observation: most LIKE queries in practice use the '%needle%' form, which is really just a substring search, overkill for a regex.

We added a new operator, INSTR, and new strategy in the optimizer:

if pattern matches '%NEEDLE%':
  return INSTR(column, needle) != -1

Instead of a full regex evaluation, we now do a direct substring check. This was a dramatic win:

0.22.0:

17.57 seconds

A 3.2x improvement over the previous version. Still not 4 seconds, but progress.

Stage 3: Direct Buffer Access & Boyer-Moore-Horspool (0.22.0 → 0.26.2)

The next optimisation recognised that we were still going through Arrow’s abstraction layers for INSTR checks.

We moved to direct buffer access:

Read the raw Arrow buffer pointers
Skip Arrow’s helper functions
Implement string search in compiled code

This allowed us to use a more sophisticated algorithm: Boyer-Moore-Horspool (BMH), a linear-time substring search that skips characters when mismatches occur.

We also added a sieve prefilter: before running the full search, we check if the needle’s characters appear at all in the haystack. If the needle contains a rare character (say 'z'), a quick scan can reject rows cheaply.

Results, 0.26.2:

7.46 seconds

Now we’re only 1.8x slower than the C++ engines. We were pleased with this given at it’s heart, Opteryx is mostly written in Python.

Stage 4: Volnitsky & Removing Arrow (Current Work)

We’re currently deep in an engine refactor:

Removing Arrow as the internal representation unlocks faster processing of dictionary-encoded arrays (common in real datasets). Columns like “country” or “status” are often dictionary-encoded; Arrow’s generic abstraction didn’t let us specialise on this.
Pushing code off the Python interpreter to reduce GIL contention. String search loops are candidates for this — they’re tight loops over uniform data.

We’ve adopted the Volnitsky algorithm which builds on BMH with multi-byte searches:

Faster in practice on real strings
Better prefilter (sieve) integration
Clearer separation between “find first candidate” and “verify match”

Early results suggest we can hit the 4 second bar set by other engines, and as all things go in circles, 2.8s of our current 3.8s lab benchmark is IO again.

The Lesson

Optimising a substring search operator looks simple in isolation, but the real wins come from:

Fixing fundamentals first (IO stack). A fast algorithm on slow data is still slow.
Recognising pattern specialisation ('%needle%' is CONTAINS, not regex).
Direct buffer access avoiding abstraction overhead.
The right algorithm (BMH and Volnitsky beat naive search by orders of magnitude).
Preparing for parallelism (GIL-free code paths).

This progression from 93 seconds to 4 seconds shows that cumulative, targeted optimisations compound. We’re not done yet.

Making LIKE Faster: From 93 Seconds to Single Digits was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

What If the Docs Wrote Themselves?

Justin Joyce — Fri, 03 Apr 2026 20:36:13 GMT

This post is also published on the Opteryx Engineering Blog

TL;DR

Updating documentation is hard to get right. Code changes quickly, documentation lives somewhere else, and keeping the two aligned usually depends on somebody remembering to update both systems.

Imagine changing code and having the user documentation update itself from the same source of truth. That means less drift, more consistent coverage, and less time spent maintaining duplicate descriptions.

We are moving toward that model in Opteryx by extracting API metadata from the source repository into JSON, then using the documentation repository to turn that structured data into user-facing documentation. It is still a work in progress, but it is already improving the quality of the metadata in the codebase.

If you are maintaining technical docs by hand, this is the prompt to start moving them closer to the code.

This post is also published on the Opteryx Engineering Blog

The Problem

Documentation drifts.

That is especially true for product areas that change frequently, and it is particularly visible in systems like Opteryx, where the user-facing surface includes:

data types
functions
operators
APIs

These areas already exist as structured concepts in the engine, but the documentation for them often lives elsewhere, maintained separately, and updated later.

That creates a familiar set of problems:

docs and implementation fall out of sync
small API changes are easy to miss
coverage becomes uneven
quality depends on whether someone remembered to update a second system

None of this is unusual. But it is a signal that the documentation process is too detached from the thing being documented.

The Shift in Approach

The direction we are moving toward is to document the thing at the thing.

Instead of treating the documentation site as the primary source of truth, we are treating the source code and its metadata as the starting point.

The current shape of the pipeline is:

source repo
  -> extract API metadata
  -> generate JSON manifest
  -> hand off to docs repo
  -> render user documentation

The point is not just automation for its own sake.

The point is that the engine already knows a great deal about its user surface. If we can extract that information in a structured form, we can generate documentation more consistently and with less manual duplication.

Why This Matters

There are a few immediate benefits to this approach.

First, it improves consistency.

If data types, functions, and operators are described through a common metadata structure, the resulting documentation becomes more uniform. The same kinds of information can appear in the same places, in the same format, across the site.

Second, it improves coverage.

Manual documentation tends to accumulate around the most visible or most recently changed features. Generated documentation makes it easier to see what is missing because gaps in metadata become explicit.

Third, it improves hygiene in the codebase.

Once annotations and inline descriptions are surfaced directly to users, poor metadata becomes much more obvious. That creates healthy pressure to improve the implementation details that describe the API surface.

This is one of the most useful side effects of the work. The tooling is not just producing docs, it is encouraging higher-quality API definitions.

What We Are Generating

Today, the focus is on structured user-facing elements such as:

Many of these are already created programmatically inside Opteryx.

That is useful because it means we are not starting from an unstructured codebase. There is already a model of these concepts in the system. The work now is to improve the metadata around them so that the model is rich enough to support good documentation.

In practice, that means making sure each documented element can expose things like:

name
signature
description
supported argument patterns
return behavior
examples
notes on semantics or edge cases

Not every part of that is complete yet, and some of the existing definitions have issues. But the important point is that these problems become easier to identify and more straightforward to fix once the documentation pipeline depends on structured metadata.

The Trade-Offs

There is no real shortcut here.

Generated documentation only works well if the underlying metadata is good. If the source descriptions are weak, incomplete, or inconsistent, then the generated output will be weak, incomplete, or inconsistent too.

So the trade-off is clear:

less manual duplication later
more discipline required in the source now

That is a trade worth making.

It shifts effort away from maintaining duplicate descriptions in separate systems and toward improving the definitions closest to the implementation. Over time that should make both the code and the documentation better.

What This Changes Culturally

This is not just a tooling change.

It changes where documentation work happens.

Instead of thinking of docs as a separate writing task that happens after engineering is done, this pushes us toward treating API metadata as part of the engineering work itself.

That is a better fit for technical surfaces that evolve quickly. It means changes can be described where they are introduced, reviewed where they are implemented, and surfaced to users through a repeatable pipeline.

For a project like Opteryx, that is a more scalable model than relying on manual synchronization.

What’s Next

The immediate next step is to keep improving the extraction pipeline and the metadata behind it.

The generated JSON is only useful if it captures enough meaning for the docs repository to produce documentation that is genuinely helpful to users.

That means we still need to refine:

the completeness of API annotations
the consistency of field definitions
the shape of the JSON manifest
how examples and semantic notes are represented

This is still work in progress, but the direction already looks right.

If documenting an API programmatically exposes weak metadata, that is not a failure of the approach.

That is the approach doing its job.

What If the Docs Wrote Themselves? was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Rewriting the Opteryx I/O Stack

Justin Joyce — Sat, 28 Mar 2026 11:57:16 GMT

This post is also published on the Opteryx Engineering Blog

TL;DR

Cold queries over 100 million rows stored as Parquet on object storage originally took 5 minutes in Opteryx. Profiling showed the execution engine repeatedly stalling while waiting for the IO buffer to refill.

We rewrote the IO stack to schedule fine-grained byte-range reads based on Parquet metadata (column chunks within row groups) and pipelined reads, decompression, and decoding.

The same cold query now completes in ~10 seconds.

The key lesson: when reading from object storage, throughput alone isn’t enough — the granularity of work matters.

The Problem

While profiling Opteryx we noticed queries stalling even though the execution engine itself appeared efficient.

The workload was straightforward:

~100 million rows
stored as Parquet
object storage backend
cold query execution

A simple query was sufficient to surface the issue:

SELECT DISTINCT column
FROM dataset;

Runtime was approximately five minutes.

Profiling showed that the execution engine was frequently idle while waiting for the IO buffer to refill. CPU utilization remained low even though the network and reader threads were active.

The engine wasn’t compute-bound. It was waiting for data.

Initial Attempts

The first assumption was insufficient read parallelism.

The number of IO workers was increased:

8 → 16 → 32

This produced almost no change in query time.

The next hypothesis was runtime contention. To test this, the entire IO subsystem was moved into a dedicated process, communicating with the execution engine via a shared-memory ring buffer. This completely separated network activity from execution.

The stalls remained.

At this point it became clear the system was already close to the available network bandwidth per container. The issue wasn’t CPU scheduling or decode overhead.

The issue was how data was being delivered to the engine.

The Real Issue: Coarse Units of Work

The Parquet files in the dataset were roughly:

128MB uncompressed
~30MB compressed in object storage

The IO subsystem was issuing large contiguous reads. Even with many workers the pattern looked roughly like this:

issue read
wait for blob
large chunk arrives
engine consumes
wait for next read

The network was busy, but usable data arrived in bursts.

The time between issuing a request and receiving data was long enough to starve execution. Increasing worker count did not solve this; it simply queued more large reads.

The bottleneck was not bandwidth, it was granularity.

Rewriting the IO Stack

The solution was to redesign the IO subsystem around smaller units of work.

Parquet files contain detailed structural metadata in the footer:

row group offsets
column chunk offsets
compressed sizes
exact byte ranges

Using this information, the new IO stack schedules targeted range reads only for the column chunks required by the query.

Reads, decompression, and decoding are pipelined so the execution engine begins receiving usable data earlier.

The unit of work changed from:

file

to:

column chunk within a row group

This allows the system to deliver smaller fragments of data continuously rather than waiting for large reads to complete.

Results

The initial redesign reduced query time significantly:

~5 minutes → ~1 minute

After further improvements to buffering, file sizes and execution scheduling, the same cold query now completes in approximately:

~10 seconds

The improvement did not come from increasing available bandwidth. In fact, single-threaded decode performance is slightly slower with the new reader.

Instead of receiving large bursts of data separated by latency gaps, the execution engine now receives smaller fragments continuously. CPU utilization increases because the pipeline is almost never idle.

Reproducing the Pattern

This issue commonly appears when:

data is stored in object storage
reads operate on large blobs
execution pipelines are faster than IO latency

Typical symptoms include:

low CPU utilization
active network traffic
periodic stalls in execution

In these cases, adding threads or processes rarely helps if the unit of work remains large.

Conclusion

Object storage behaves very differently from local disks.

Large sequential reads introduce latency gaps that parallelism alone cannot eliminate. Even with many workers, execution pipelines can stall if each unit of work is too large.

The practical takeaway is:

Optimize how finely work can be scheduled, not just how fast it runs.

For Opteryx this meant redesigning the IO stack around fine-grained byte-range reads derived from Parquet metadata.

The system was never limited by Python. It was never limited by Parquet.

It was limited by treating object storage like a disk.

What’s Next

This IO redesign changes assumptions elsewhere in the engine.

Several components were originally optimized around file-sized units of work. Moving to fine-grained reads means revisiting parts of the execution pipeline so they can fully benefit from the new architecture.

We expect additional improvements as more of the engine adapts to this model.

Rewriting the Opteryx I/O Stack was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Opteryx Engineering: Fixing Engine Stalls by Rethinking Parquet

Justin Joyce — Sun, 01 Mar 2026 13:03:50 GMT

We had hit a problem that didn’t make sense.

Opteryx was stalling.

Not under extreme load. Not in edge cases. Under perfectly reasonable query conditions. CPU wasn’t saturated. Memory wasn’t constrained. Reader threads were active. And yet the engine kept waiting for data.

If you build data systems, you’ve probably seen this pattern: nothing appears to be the bottleneck, but something is holding the performance back.

That’s where we started.

The Symptom: An Idle Engine

We were running cold queries over just under 100 million rows stored in Parquet on object storage. A simple SELECT DISTINCT over a single column was enough to surface the issue.

The query took around five minutes.

Profiling showed something uncomfortable: execution wasn’t the bottleneck. The engine spent significant time waiting for the IO buffer to refill. Workers were issuing requests. The network was active. But the data wasn’t arriving in a way that kept the engine fed.

The first instinct was obvious: increase read parallelism.

We moved from 8 IO workers to 16, then 32.

The stalls remained.

The First Attempt: Move IO Out of Process

The next hypothesis was runtime contention. Perhaps Python’s threading model was interfering with throughput. If IO and execution were competing for scheduling, then separating them might smooth things out.

We moved the entire IO subsystem into a dedicated IO process and moved IPC to a shared memory ring buffer. Execution was completely isolated from network handling; if CPU scheduling was the issue, this would fix it.

It didn’t.

What we saw was that we were already pushing close to the available network bandwidth per container. We weren’t CPU-bound, we weren’t decode-bound, we were network-bound.

That forced us to look more carefully at how we were using the network.

The Real Problem: Our Unit of Work Was Too Large

Each Parquet file was around 128MB uncompressed, roughly 30MB compressed in object storage. Our readers were pulling large contiguous ranges at a time.

Even with 32 workers, the pattern looked like this:

Issue blob read
Wait for data to arrive
Large chunk arrives
Engine consumes
Wait for the next blob to arrive

The network was busy. But the flow was bursty.

The time between “request issued” and “usable data available” was long enough to starve execution. More threads didn’t smooth it out; they just queued more large reads and slightly increased thrashing. The engine was still many times faster than the IO could provide it with data.

We weren’t short on bandwidth; we were short on granularity.

The Fortuitous Timing

At the same time, we were already working on something else.

We had started building a lean Parquet decoder, Rugo, originally to move away from some of the constraints of using PyArrow as our runtime dependency. PyArrow is excellent, but it optimises for generality and safety. As consumers, we’re not meant to reach into its internals or control byte-level scheduling.

That work turned out to be well-timed because once we realised that the problem was granularity, we needed exactly the kind of control that a general-purpose library doesn’t expose.

With PyArrow, decomposing Parquet into fine-grained, independently schedulable byte ranges would have meant relying on internals we weren’t meant to touch. With Rugo, we could make that a first-class feature of the design.

The Shift: Decomposing the Parquet

The Parquet footer contains everything needed to reason about the file structurally:

Row group offsets
Column chunk offsets
Compressed sizes
Exact byte ranges

Instead of thinking in terms of “read the file” or even “read the row group,” we started thinking in terms of byte ranges.

We decomposed the file into the smallest functional units, column chunks within row groups, and issued targeted range requests only for the data required by the query. Reads, decompression, and decoding were pipelined so that smaller fragments, which meant the latency between request and process became shorter.

The unit of work stopped being a file.

The Outcome

The impact wasn’t marginal.

That same cold query over just under 100 million rows dropped from roughly five minutes to around one minute.

We didn’t magically gain more bandwidth, and moving to Rugo actually slightly decreased single-threaded decode speed.

What changed was continuity. The execution buffer wasn’t empty for large portions of the query time. CPU stayed fed.

We were still bandwidth-bound, but we were being more targeted and less wasteful about how we were using bandwidth.

The Lesson

If you’re building on object storage, especially with a push-based or streaming execution model, raw throughput isn’t enough.

Granularity matters.

Large, coarse units of work introduce latency gaps that parallelism alone can’t eliminate. You can add threads. You can move subsystems into separate processes. You can increase CPU. But if your work units are too large to schedule effectively, your pipeline will stall.

The practical takeaway is simple:

Don’t just optimise how fast you process data — Optimise how finely you can divide it.

For us, that meant decomposing Parquet blobs into smaller, schedulable units — something that became possible because we had already started building a reader that gave us the necessary control.

The stall wasn’t caused by Python.

It wasn’t caused by Parquet.

It was caused by treating object storage like a disk.

Once we stopped doing that, the system behaved like it was designed to.

What happens next?

This is a fundamental change to the IO subsystem in the engine. Whilst we’ve made good progress and shown early gains from the revised architecture, to achieve more benefits from this approach, other parts of the engine, which were designed and optimised for handling file-at-a-time, need revisiting. This change will challenge design assumptions throughout the engine, so we’re anticipating more changes to the engine to improve the flow of data and further reduce query times over the next few weeks.

Opteryx Engineering: Fixing Engine Stalls by Rethinking Parquet was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

How We’ve Supercharged Join Performance in Opteryx

Justin Joyce — Tue, 01 Jul 2025 21:39:50 GMT

Join performance is something databases live and die on — it needs to be fast. Join performance is not a nice-to-have; it’s a must-have.

In the early days of Opteryx, we were getting feedback about not having joins implemented. We came across https://github.com/TomScheffers/pyarrow_ops, which at the time had features that weren’t implemented into PyArrow yet, and we incorporated this into Opteryx. The joins worked, but like all brute force solutions, it didn’t scale gracefully. When PyArrow introduced native join support, we quickly adopted it. For a time, it gave us a speed boost and allowed us to move faster elsewhere. But like many general-purpose tools, it came with limitations — it has some limitations which our users weren’t happy with.

So, once we reached speed parity with PyArrow in very specific scenarios, we dropped it — we stopped using PyArrow’s join implementation, first for INNER JOIN, and quickly after the OUTER JOIN and eventually FILTER JOINs too. Most of the time, we were a little slower than PyArrow, but not by orders of magnitude, and it gave us full control to innovate.

Introducing Bloom Filters (Proof of Concept)

A few releases ago, we experimented with adding a bloom filter to the join pipeline — initially, a targeted proof of concept:

It only activated on single-column VARCHAR joins
It only supported up to 1 million rows on the build side

Despite its limitations, the experiment let us explore key questions:

Should we speculatively build a filter for every compatible join?
What filter ratio threshold makes bloom filtering not worth the effort?
Could we build a non-general-purpose bloom filter that was really fast?

This initial foray was promising. But the real gains came when we linked improvements across related parts of the engine.

The Unexpected Speed Bottleneck in DISTINCT

While optimising DISTINCT, we stumbled upon something eye-opening.

Our Cython implementation of DISTINCT (which was already faster than our original pure-Python version that we also derived from pyarrow_ops) converted PyArrow columns to NumPy arrays, hashed them in Python, and combined those hashes. Reasonable, right?

Except we found that hashing a 100 million row dataset spent 7 seconds just converting Arrow to NumPy — before doing anything meaningful.

That was unacceptable.

So, we rethought the internals. We updated DISTINCT to access PyArrow’s memory buffers directly. In most cases, we didn’t even need to interpret the values — we just treated them as raw bytes and passed them straight into our hash function. No decoding. No conversions. Minimal allocations.

The result? DISTINCT ran up to 15x faster in lab conditions.

A Shared Foundation: JOIN and DISTINCT

JOIN and DISTINCT may serve different purposes, but at their core, they both rely on building and using collections of hashes. So we applied the same philosophy to our joins.

We:

Rewrote the Bloom filter to use the same method to hash values
Rebuilt the hash map generation on the build side
Reimplemented the probe logic, which had now become the bottleneck

The improvements were immediate and substantial.

In contrived benchmark cases (e.g. 10 million vs 10 million integer INNER JOIN with no matches), we saw a 10x speedup. In more representative real-world workloads, the improvements averaged around 40% faster join performance.

What’s Next?

This isn’t the end — it’s the start of a new performance frontier. We’re exploring:

SIMD-acceleration
More Join type–specific tuning (e.g. LEFT OUTER vs INNER)

Opteryx is designed for speed. With this new join pipeline, we’re doubling down on that promise.

Thank you for your continued support and feedback. As always, we encourage you to reach out with any insights or issues you encounter — we’re here to listen and improve together.

You can check out Opteryx for yourself, drop into GitHub and give us a star!

How We’ve Supercharged Join Performance in Opteryx was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Strengthening Opteryx: A Deeper Look at Our Evolving Testing Practices

Justin Joyce — Sat, 31 Aug 2024 13:23:44 GMT

At Opteryx, quality and reliability have always been at the forefront of our development process. Since our last major release, we’ve be taking significant steps to reinforce our testing suite. The number of CI tests have risen from 2987 to 3931 at last count (about a 33% increase), however, increase in test cases has brought with it some interesting challenges and insights, particularly regarding our coverage statistics and the quality of our regression testing.

A quick summary of our testing approach can be found here. As a generalization, we’ve preferred end-to-end testing (do SQL statements create the correct result) over unit tests.

The Testing Paradox: More Tests, Same Coverage

One of the most intriguing aspects of our recent testing efforts is that despite adding nearly 1000 new tests, our code coverage statistics have increased 1%, and from an initial analysis this change is more likely to be from removal of code than an increase in the tests.

The answer lies in the nature of the new tests we’ve added. Most of these tests are focused on variation and combination testing rather than new functionality. What we were finding were edge cases, where a subtle variation in the query resulted in an error.

Addressing New Bug Reports

One of the key drivers for reviewing our test approach was due to users reporting issues we thought should have been caught during regression testing.

To address this, we’ve taken a two-pronged approach:

1. Improving Test Variety: We’re focusing on expanding our tests to cover more variations of scenarios. This includes adding more tests that challenge the assumption that if if one variation works, all similar ones will work. For example, just because 12.0*12 and 12+12 work, doesn’t mean 12.0+12 will work.

2. Introducing the Join Fuzzer: Recognizing the importance of joins in SQL operations, we’ve developed a new tool, the Join Fuzzer. This tool systematically generates and tests a wide range of join scenarios, helping us identify and fix bugs that might slip through more conventional tests.

What we found was that users were uncovering niche bugs in features which were tested, but usually only in one or two test cases rather than sweating the test suite with 10s of variations to root out edge case bugs.

The Path Forward: Increasing Test Depth and Breadth

While our existing tests provide a solid foundation, we know there’s more work to be done. Our focus moving forward is to broaden our test coverage in ways that truly matter for our users.

Conclusion

Our recent efforts have reinforced our commitment to delivering a reliable and robust SQL Query Engine. While the increase in test cases has highlighted some interesting challenges, it has also provided us with invaluable insights that will guide our future development.

The introduction of the Join Fuzzer and our focus on expanding test coverage in more complex areas are just the beginning. We are dedicated to ensuring that Opteryx not only meets but exceeds the expectations of our growing user base.

Thank you for your continued support and feedback. As always, we encourage you to reach out with any insights or issues you encounter — we’re here to listen and improve together.

You can check out Opteryx for yourself, drop into GitHub and give us a star!

Strengthening Opteryx: A Deeper Look at Our Evolving Testing Practices was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Opteryx 0.15 Release

Justin Joyce — Mon, 27 May 2024 16:19:53 GMT

We are pleased to announce the release of Opteryx 0.15, which focuses on significant performance improvements, new experimental features, and important reliability and usability enhancements.

Performance Improvements

In typical query execution, I/O operations, particularly those involving cloud storage, can account for up to 90% of the total execution time. Opteryx 0.15 addresses this by introducing parallelized reading from Google Cloud Storage, which significantly reduces data retrieval times. Combined with caching improvements, this optimization can result in up to a 500% improvement in performance for specific workloads, enhancing overall query execution speed.

Query execution improvements through additional optimization techniques and algorithmic improvements. Filters are pushed down into sub-queries and the UNNEST operation, reducing data volumes, and a new Heap Sort fused operator combines LIMIT and ORDER BY into a single efficient process.

Experimental Features

Full-Text Search Enhancements: Opteryx 0.15 introduces experimental support for the MATCH AGAINST feature, enabling advanced full-text search capabilities. This capability is likely to change in the future so should not be used in production systems.

Initial SQL VIEW Support: This version includes initial support for SQL VIEWs, allowing users to create virtual tables based on query results. This feature is experimental and will be further developed in future releases.

Reliability and Usability Improvements

Insights from the ClickBench benchmark have driven several improvements in this release.

New Permissions Capabilities

New capabilities have been introduced to help users build more intelligent permissions overlays, allowing for more granular and secure data access controls. Users can now implement custom permissions models that ensure data security and compliance with access policies.

Conclusion

Opteryx 0.15 delivers substantial performance improvements, new experimental features, and enhanced reliability and usability. We encourage users to upgrade to this latest version to benefit from these advancements.

We value your feedback and look forward to your continued support and contributions as we work to further improve Opteryx. Join us on GitHub or drop in to give us a star.

Thank you for being a part of the Opteryx community.

Opteryx 0.15 Release was originally published in Opteryx Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.