close
Skip to content

feat(server): NUMA awareness#2412

Merged
spetz merged 31 commits into
apache:masterfrom
tungtose:numa-awareness
Dec 15, 2025
Merged

feat(server): NUMA awareness#2412
spetz merged 31 commits into
apache:masterfrom
tungtose:numa-awareness

Conversation

@tungtose
Copy link
Copy Markdown
Contributor

Flow

numaaware

This PR addressing #2387

Benchmark

I was trying to launch an EC2 machine with multiple NUMA cores, but they wouldn't allow me to do so. They require me to submit a ticket and wait for approval.

Here is the bench on my local machine: Intel(R) Core(TM) Ultra 9 285H

Bench cmd: target/release/iggy-bench -m 200 -r 800MB pp tcp

Before:

2025-11-26T07:22:39.247127Z INFO bench_report::prints: \x1b[34mBenchmark: Pinned Producer, 8 producers, 8 streams, 1 topic per stream, 1 partitions per topic, 8000000 messages, 1000 messages per batch, 8000 message batches, 100 bytes per message, 800MB of data processed
\x1b[0m
2025-11-26T07:22:39.247182Z INFO bench_report::prints: \x1b[32mProducers Results: Total throughput: 790.38 MB/s, 7903786 messages/s, average throughput per Producer: 98.80 MB/s, p50 latency: 0.80 ms, p90 latency: 1.30 ms, p95 latency: 1.45 ms, p99 latency: 1.80 ms, p999 latency: 2.49 ms, p9999 latency: 15.00 ms, average latency: 0.84 ms, median latency: 0.80 ms, min: 0.26 ms, max: 8.52 ms, std dev: 0.06 ms, total time: 1.11 s\x1b[0m

After:

x1b[34mBenchmark: Pinned Producer, 8 producers, 8 streams, 1 topic per stream, 1 partitions per topic, 8000000 messages, 1000 messages per batch, 8000 message batches, 100 bytes per message, 800MB of data processed
\x1b[0m
2025-11-26T07:29:38.128852Z INFO bench_report::prints: \x1b[32mProducers Results: Total throughput: 798.21 MB/s, 7982140 messages/s, average throughput per Producer: 99.78 MB/s, p50 latency: 0.46 ms, p90 latency: 0.77 ms, p95 latency: 0.92 ms, p99 latency: 1.83 ms, p999 latency: 3.58 ms, p9999 latency: 4.32 ms, average latency: 0.51 ms, median latency: 0.46 ms, min: 0.17 ms, max: 1.78 ms, std dev: 0.04 ms, total time: 1.00 s\x1b[0m

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Nov 26, 2025

I have Ryzen 9 9950X3D, and I see prints (using default config):

2025-11-26T08:16:44.969801Z  INFO main iggy_server: Using 0 shards with 0 NUMA node, 0 Core per node, and avoid hyperthread true
2025-11-26T08:16:44.969809Z  INFO main iggy_server: Using mimalloc allocator
2025-11-26T08:16:44.969812Z  INFO main iggy_server: Starting 16 shard(s)

Here's my numa topology (i just printed NumaTopology crate hwlocality in ShardAllocator::new()):

{
    topology: Topology {
        is_abi_compatible: true,
        build_flags: BuildFlags(
            0x0,
        ),
        is_this_system: true,
        feature_support: FeatureSupport {
            discovery: Some(
                DiscoverySupport {
                    pu_count: true,
                    numa_count: true,
                    numa_memory: true,
                },
            ),
            cpubind: Some(
                CpuBindingSupport {
                    set_current_process: true,
                    get_current_process: true,
                    set_process: true,
                    get_process: true,
                    set_current_thread: true,
                    get_current_thread: true,
                    set_thread: true,
                    get_thread: true,
                    get_current_process_last_cpu_location: true,
                    get_process_last_cpu_location: true,
                    get_current_thread_last_cpu_location: true,
                },
            ),
            membind: Some(
                MemoryBindingSupport {
                    set_current_process: false,
                    get_current_process: false,
                    set_process: false,
                    get_process: false,
                    set_current_thread: true,
                    get_current_thread: true,
                    set_area: true,
                    get_area: true,
                    get_area_memory_location: true,
                    allocate_bound: true,
                    first_touch_policy: true,
                    bind_policy: true,
                    interleave_policy: true,
                    next_touch_policy: false,
                    migrate_flag: true,
                },
            ),
        },
        type_filter: {
            "Bridge": KeepNone,
            "Core": KeepAll,
            "Group": KeepStructure,
            "L1Cache": KeepAll,
            "L1ICache": KeepNone,
            "L2Cache": KeepAll,
            "L2ICache": KeepNone,
            "L3Cache": KeepAll,
            "L3ICache": KeepNone,
            "L4Cache": KeepAll,
            "L5Cache": KeepAll,
            "Machine": KeepAll,
            "Misc": KeepNone,
            "NUMANode": KeepAll,
            "OSDevice": KeepNone,
            "PCIDevice": KeepNone,
            "PU": KeepAll,
            "Package": KeepAll,
        },
        objects_per_depth: [
            (
                "0",
                [
                    Machine with CpuSet(0-31) (
                      total=96336932KB,
                      DMIProductName=MS-7E59,
                      DMIProductVersion=2.0,
                      DMIBoardVendor="Micro-Star International Co., Ltd.",
                      DMIBoardName="MAG X870E TOMAHAWK WIFI (MS-7E59)",
                      DMIBoardVersion=2.0,
                      DMIBoardAssetTag="To be filled by O.E.M.",
                      DMIChassisVendor="Micro-Star International Co., Ltd.",
                      DMIChassisType=3,
                      DMIChassisVersion=2.0,
                      DMIChassisAssetTag="To be filled by O.E.M.",
                      DMIBIOSVendor="American Megatrends International, LLC.",
                      DMIBIOSVersion=2.A91,
                      DMIBIOSDate=09/09/2025,
                      DMISysVendor="Micro-Star International Co., Ltd.",
                      Backend=Linux,
                      LinuxCgroup=/user.slice/user-1000.slice/user@1000.service/app.slice/app-Alacritty@a6178dd85eea4b1e910e4663bd122a64.service,
                      OSName=Linux,
                      OSRelease=6.17.9-2-cachyos,
                      OSVersion="#1 SMP PREEMPT_DYNAMIC Tue, 25 Nov 2025 01:13:51 +0000",
                      HostName=atlas,
                      Architecture=x86_64,
                      hwlocVersion=2.12.2,
                      ProcessName=iggy-server
                    ),
                ],
            ),
            (
                "1",
                [
                    Package with CpuSet(0-31) (
                      total=96336932KB,
                      CPUVendor=AuthenticAMD,
                      CPUFamilyNumber=26,
                      CPUModelNumber=68,
                      CPUModel="AMD Ryzen 9 9950X3D 16-Core Processor          ",
                      CPUStepping=0
                    ),
                ],
            ),
            (
                "2",
                [
                    Die with CpuSet(0-7,16-23),
                    Die with CpuSet(8-15,24-31),
                ],
            ),
            (
                "3",
                [
                    L3Cache with CpuSet(0-7,16-23) (
                      size=98304KB,
                      linesize=64,
                      ways=16,
                      Inclusive=0
                    ),
                    L3Cache with CpuSet(8-15,24-31) (
                      size=32768KB,
                      linesize=64,
                      ways=16,
                      Inclusive=0
                    ),
                ],
            ),
            (
                "4",
                [
                    L2Cache with CpuSet(0,16) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(1,17) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(2,18) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(3,19) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(4,20) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(5,21) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(6,22) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(7,23) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(8,24) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(9,25) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(10,26) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(11,27) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(12,28) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(13,29) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(14,30) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                    L2Cache with CpuSet(15,31) (
                      size=1024KB,
                      linesize=64,
                      ways=16,
                      Inclusive=1
                    ),
                ],
            ),
            (
                "5",
                [
                    L1dCache with CpuSet(0,16) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(1,17) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(2,18) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(3,19) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(4,20) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(5,21) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(6,22) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(7,23) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(8,24) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(9,25) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(10,26) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(11,27) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(12,28) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(13,29) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(14,30) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                    L1dCache with CpuSet(15,31) (
                      size=48KB,
                      linesize=64,
                      ways=12,
                      Inclusive=0
                    ),
                ],
            ),
            (
                "6",
                [
                    Core with CpuSet(0,16),
                    Core with CpuSet(1,17),
                    Core with CpuSet(2,18),
                    Core with CpuSet(3,19),
                    Core with CpuSet(4,20),
                    Core with CpuSet(5,21),
                    Core with CpuSet(6,22),
                    Core with CpuSet(7,23),
                    Core with CpuSet(8,24),
                    Core with CpuSet(9,25),
                    Core with CpuSet(10,26),
                    Core with CpuSet(11,27),
                    Core with CpuSet(12,28),
                    Core with CpuSet(13,29),
                    Core with CpuSet(14,30),
                    Core with CpuSet(15,31),
                ],
            ),
            (
                "7",
                [
                    PU with CpuSet(0),
                    PU with CpuSet(16),
                    PU with CpuSet(1),
                    PU with CpuSet(17),
                    PU with CpuSet(2),
                    PU with CpuSet(18),
                    PU with CpuSet(3),
                    PU with CpuSet(19),
                    PU with CpuSet(4),
                    PU with CpuSet(20),
                    PU with CpuSet(5),
                    PU with CpuSet(21),
                    PU with CpuSet(6),
                    PU with CpuSet(22),
                    PU with CpuSet(7),
                    PU with CpuSet(23),
                    PU with CpuSet(8),
                    PU with CpuSet(24),
                    PU with CpuSet(9),
                    PU with CpuSet(25),
                    PU with CpuSet(10),
                    PU with CpuSet(26),
                    PU with CpuSet(11),
                    PU with CpuSet(27),
                    PU with CpuSet(12),
                    PU with CpuSet(28),
                    PU with CpuSet(13),
                    PU with CpuSet(29),
                    PU with CpuSet(14),
                    PU with CpuSet(30),
                    PU with CpuSet(15),
                    PU with CpuSet(31),
                ],
            ),
            (
                "<NUMANode>",
                [
                    NUMANode with CpuSet(0-31) (
                      local=96336932KB,
                      total=96336932KB
                    ),
                ],
            ),
        ],
        memory_parents_depth: Ok(
            PositiveInt(1),
        ),
        cpuset: CpuSet(0-31),
        complete_cpuset: CpuSet(0-31),
        allowed_cpuset: CpuSet(0-31),
        nodeset: NodeSet(0),
        complete_nodeset: NodeSet(0),
        allowed_nodeset: NodeSet(0),
        distances: Ok(
            [],
        ),
    },
    node_count: 1,
    physical_cores_per_node: [
        16,
    ],
    logical_cores_per_node: [
        32,
    ],
}

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Nov 26, 2025

But if i run this program on my PC:

use hwlocality::object::types::ObjectType;
use hwlocality::topology::Topology;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let topo = Topology::new()?;
    let numa_count = topo.objects_with_type(ObjectType::NUMANode).len();

    println!("NUMA nodes: {numa_count}");
    Ok(())
}

I get: NUMA nodes: 1

@tungtose
Copy link
Copy Markdown
Contributor Author

Thanks @hubcio. It is indeed one node. The reason is that I put the wrong logic in the print info. The log means using Node number 0, not 0 nodes. Let me update that

@hubcio
Copy link
Copy Markdown
Contributor

hubcio commented Dec 12, 2025

I checked your PR on AWS, unfortunately i'm not able to book ec2 instances with numa nodes > 1, however on my PC it works fine.

I'm gonna need you to fix one more thing: when building for musl targets, please use vendored hwloc because we don't want to link with glibc.

[target.'cfg(target_env = "musl")'.dependencies]
hwlocality = { version = "1.0.0-alpha.11", features = ["vendored"] }

[target.'cfg(not(target_env = "musl"))'.dependencies]
hwlocality = { version = "1.0.0-alpha.11" }

Other than that LGTM.

@tungtose
Copy link
Copy Markdown
Contributor Author

I checked your PR on AWS, unfortunately i'm not able to book ec2 instances with numa nodes > 1, however on my PC it works fine.

I'm gonna need you to fix one more thing: when building for musl targets, please use vendored hwloc because we don't want to link with glibc.

[target.'cfg(target_env = "musl")'.dependencies]
hwlocality = { version = "1.0.0-alpha.11", features = ["vendored"] }

[target.'cfg(not(target_env = "musl"))'.dependencies]
hwlocality = { version = "1.0.0-alpha.11" }

Other than that LGTM.

Thanks @hubcio, I have updated it, along with the CI fix

@spetz spetz merged commit a5d5694 into apache:master Dec 15, 2025
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants