Stories by Pinterest Engineering on Medium

Making User-Sequence Data More Cost-Efficient, Faster, and Easier to Use

Pinterest Engineering — Thu, 21 May 2026 16:01:00 GMT

Authors (listed alphabetically)
Ads Feature Engineering Infra team: Ajay Venkatakrishnan, Le Zhang
Core ML Infra team: Eric Shang, Pihui Wei
ML Data team: Connor Votroubek, Yi He
User Understanding team: Camilo Munoz, Simin Li

If you work on ranking, retrieval, or recommendation systems, you’ve probably asked for some version of the same thing: “Give me the last N meaningful actions this user took, with the right enrichments, in a format that’s easy to train and serve ML models.”

On paper, that sounds simple. In practice, “user sequences” often become one of the most expensive and fragile parts of the ML data stack.

They end up powering everything from training datasets to offline analysis and online inference, so they need to be fresh and complete at the same time.
They must remain consistent as you add new events and enrichments.
And they have to do all of this while serving latency‑sensitive production workloads.

This article walks through how we redesigned our user‑sequence platform to make these sequences cheaper to run, faster to extend, and easier to debug, while still supporting demanding production use cases.

What We Mean by “User Sequence”

In this context, a user sequence is an ordered list of recent, relevant events for a user, along with the enrichments (signals) attached to each event. Here, enrichments mean all the extra signals we attach to raw events, so they’re useful for models: embeddings (for example, Pin or query representations), contextual features (such as surface, device, or country), and derived attributes or counters that describe how the user interacted with a piece of content over time.

A concrete example helps. Imagine a sequence made up of the last 500 engagements a user had with Pinterest Pins. Each event in that sequence might carry a timestamp, an action type, the surface where the action occurred, and a handful of embedding features or categorical attributes.

As a data primitive, user sequences are powerful. They capture temporal behavior instead of just aggregates like “how many clicks” over a period. They enable sequence‑aware models such as Transformers, sequence encoders, or attention‑over‑history architectures. And because they preserve fairly raw behavior, they can be reused across ranking, retrieval, exploration, anomaly detection, and other workloads.

The catch is that a high‑quality sequence is not just “the N latest events from a log table.” It is the result of a multi‑step process:

Ingest events from diverse sources,
Filter down to the subset of events that matter,
Enrich each event with additional signals (embeddings, metadata, and so on), and
Finally assemble those enriched events into a stable, well‑defined sequence representation.

Doing this once is easy. Doing it in a way that supports many teams, many event types, and many models over multiple years is where things get interesting.

Context: Where Sequences Show Up and Why Quality Is Hard

User sequences sit underneath almost every user-facing surface: Home feed(HF), Related Pins (RP), Search Results (SR), and many others. They power both organic products and ads across these surfaces in Pinterest, so any regression in sequence quality shows up quickly in user experience and revenue.

From an infrastructure point of view, they show up in three main places.

In training datasets, offline pipelines pull long history windows of enriched events per user in order to build sequence features.
In offline analysis, data scientists dissect user behavior across sessions, surfaces, or campaigns using sequence‑level queries.
And in online inference, real‑time services fetch up‑to‑date user sequences at request time to feed ranking and retrieval models.

Across these use cases, sequence quality turns out to be multi‑dimensional. Freshness measures how quickly new events and enrichments show up in the sequence. Completeness asks whether late‑arriving events, corrections, or backfills are eventually reflected. Consistent enrichment is about ensuring that the same enrichments are available across streaming and batch, and that training and serving see aligned data. Stable schemas matter as well: downstream consumers need schemas to be versioned and predictable, not silently changed.

One more constraint is that this is a multi‑tenant platform. It has to support many teams and models, each with different needs and lifecycles. That makes correctness, observability, and operability just as important as raw throughput or latency.

Goals (and Non‑Goals)

When we stepped back to redesign the platform, we framed the work with a small set of explicit goals and non‑goals.

Goals

Provide a consistent “events → enriched signals → sequences” contract.
Downstream consumers such as ML engineers and data scientists should see a stable, well‑defined interface that explains how events are filtered, enriched, and assembled into sequences, independent of the underlying runtime.
Improve cost‑efficiency at scale.
The platform should reduce storage and network usage for sequence data while keeping latency and reliability appropriate for online use.
Make onboarding new event types and enrichments faster and safer.
Adding a new signal or event type should mostly look like changing configuration and a small piece of well‑scoped code, instead of standing up a new bespoke pipeline.
Support both real‑time and batch production paths.
We want low‑latency updates for serving alongside batch backfills for historical coverage and corrections, with a clear policy for how the two paths merge.

Non‑Goals

To keep the scope tractable, we did not redesign downstream models or ranking architectures; the focus is on the platform that feeds them.
We also did not change the product definition of events (what counts as a click, a save, or a conversion). Those semantics remain owned by product and logging teams.

The Core Idea: One Definition, Many Runtimes

The key organizing principle for the redesign was simple:

Define a signal or event type once, then instantiate it consistently across multiple runtimes.

A signal definition captures which raw events to use, which enrichments to apply, and how to assemble enriched events into a sequence. That same definition is then consumed by three different kinds of workloads:

Real‑time indexing for low‑latency updates.
Batch indexing and backfill for historical data and corrections.
Online serving for fetching sequences at inference time.

This “one definition, many runtimes” approach avoids the classic split‑brain failure mode where training pipelines build sequences one way from batch tables while serving systems assemble sequences a different way from online stores. Over time, those two views naturally drift apart in subtle ways.

Instead, we rely on a single configuration surface plus a shared execution engine to keep indexing, training and serving aligned.

Architecture Overview

System Architecture Diagram

At a high level, the platform is composed of six major pieces that work together.

Ingestion (stream and batch).
Streaming ingestion handles real‑time events, while batch ingestion reads from data‑warehouse tables, log archives, or snapshots.
Enrichment and execution layer.
A shared execution engine turns raw events into enriched records based on configuration: filters, joins, and transforms. The same engine powers both streaming and batch pipelines.
Real‑time indexer.
A streaming job filters incoming events, converts them into a normalized representation, applies enrichments, and writes incremental updates to a time‑versioned store suitable for low‑latency reads.
Batch indexer and backfill pipeline.
Scheduled batch jobs read historical raw events, apply the same filter and enrichment definitions, and produce longer sequences along with reusable intermediate datasets for backfills and offline consumption.
Columnar, time‑partitioned storage.
Sequence data is stored in a columnar layout so models can read exactly the fields they need. Time partitioning keeps writes and scans focused on relevant windows, and the dataset layout supports both long‑sequence use cases and efficient truncation for shorter windows.
Online serving API.
Finally, a serving layer exposes a clean API for requesting user sequences by signal or feature name. It fetches the right columns from storage, performs request-time enrichments, and applies any final selection or trimming logic, such as “last N events within this time window.”

From the perspective of a model or client team, this all collapses into a simple contract:

Request sequence X for user U, and you’ll get a well‑defined schema of enriched events, with a documented freshness and completeness profile.

Design Decision 1: Configuration‑as‑Code for Sequences and Enrichments

What We Did

We moved sequence and enrichment definitions into configuration‑as‑code, expressed in a regular programming language (Python) with a well‑defined schema.

Our configurations describe which sequence features exist, how they’re named, and basic metadata such as owners, retention, and lifecycle stage. Event‑type configuration describes, for each event type, which enrichments apply, what filtering logic to use, and what data sources to read from. Enrichment configuration explains how to fetch or derive additional signals (for example, embeddings) and how to map them into the event schema.

These configurations are validated, compiled into a portable JSON format, stored in managed internal object storage, and then consumed by the shared execution engine across streaming, batch and serving jobs.

Why It Mattered

This approach made onboarding dramatically faster. New event types or enrichments can now be added primarily through configuration, plus small, isolated pieces of code where absolutely necessary, instead of via entirely new pipelines. That significantly reduces the concept‑to‑production time for new signals.

Treating configuration as code also improved reviewability and safety. Diffs are human‑readable, code owners can review changes, rollbacks are straightforward, and version history lives in standard version control systems.

A clearer separation of concerns followed naturally. ML and product teams focus on what they want (events, features, and filters) while platform teams focus on how to execute that configuration reliably and efficiently.

Design Decision 2 — Shared Execution Engine with Pluggable Executors

The Concept

We introduced a shared execution engine responsible for reading configuration, connecting to data sources (kafka, logs, tables, feature stores), running filtering and featurization, calling enrichment services or joining against offline tables, and finally writing enriched results to storage.

Within this engine, an executor is a plugin that converts a raw event into one or more enriched records. In plain terms, the executor is the “business logic module” for a particular event type or grouping, while the execution engine handles everything around it.

Why It Mattered

The shared engine allowed us to reuse the same core enrichment logic in both streaming jobs that handle near‑real‑time events and batch jobs that process historical data. That minimized code duplication and reduced drift between batch and real‑time behavior.

Practical Boundaries

To keep the system maintainable, we drew a clear line between framework and plugin code.

Framework responsibilities include wiring data sources and sinks, handling concurrency, retries, and backpressure, and parsing and validating configuration. Executors own the business‑specific filtering and featurization logic and the mapping from raw events to normalized user‑event representations.

Design Decision 3: Lambda Architecture for Fresh and Complete Sequences

The Challenge

Sequence consumers want two things that naturally pull in opposite directions. On one hand, they need freshness: “I want this morning’s actions reflected in ranking now.” On the other hand, they care about completeness and correctness: “If late events show up tomorrow, I still want my sequences and training data to be right.”

Real‑world data is messy. Events arrive late. Enrichment sources are recomputed or corrected. Backfills introduce new historical coverage months after the fact.

The Approach

To balance these requirements, we adopted a lambda‑style architecture for user sequences.

A streaming path processes events as they arrive and maintains a near‑real‑time view of user sequences for online inference. A batch path periodically recomputes enriched events and sequences from raw historical data, producing long sequences and reusable datasets for backfills and offline analysis.

The two paths cooperate instead of competing. The streaming path maintains the “now” view of the world, while the batch path focuses on “fixing history” and ensuring that training and long‑term analytics see consistent, corrected data.

Design Decision 4: Columnar, Time‑Partitioned Storage with Table Semantics

What We Chose

Before this redesign, we stored sequences as large, consolidated “enriched event” blobs. Every online call or offline scan had to pull the whole payload — even if a model only needed a small subset of features — so request fan‑out turned directly into heavy payload size and I/O on our storage systems.

We moved sequence storage to a columnar, time‑partitioned layout that behaves like a set of tables. Each enrichment or feature lives in its own column, and reads can select only the columns they need for a given model or analysis. Data is partitioned by time bucket so that writes and scans stay constrained to relevant partitions as history grows. Engineers can query these datasets with familiar table abstractions, which makes it easy to compare runs, versions, or backfill strategies by inspecting partitions.

Why It Mattered

This design improved both efficiency and operability. Columnar storage improves compression and reduces network bandwidth by avoiding wide “enriched event” blobs when only a few features are needed. Time partitioning keeps I/O bounded even as the system accumulates long histories.

Operationally, having clear table semantics makes it much easier to inspect anomalous days or event types, validate new enrichments, and compare old and new pipelines side by side.

Migration, Rollout, and Measurement

Redesigning a platform is one thing; migrating existing production workloads is another. We treated migration as a first‑class project.

Migration Strategy

We followed an event type by event type approach.

For a given event type, we first ran the new pipeline in parallel with the existing one and generated “shadow” sequences. We then compared those shadow outputs to the legacy sequences over a defined period.

Since we are regenerating the data using completely new jobs, we had to accept that the data won’t have a 100% match due to the nature of our online systems. As a result, we had to have thorough validations to prove that our new system was producing approximately the same sequences when compared to the legacy system.

We decided on a strategy of using two tiers of comparisons, an event-level comparison, which compared field-by-field of events we matched between our old and new indexing jobs, as well as a sequence-level comparison, comparing the shadow sequence output with the legacy sequence output. Alongside performing A/B experiments using our new data, these validations gave us the confidence that we could safely swap our pipelines with no impact.

Once we were confident in the behavior, we performed a controlled cutover by shifting consumers to read from the new architecture. We then iterated the same process across additional event types, steadily deprecating the legacy path.

What We Measured and Achieved

To stay within company policies, we only describe qualitative outcomes here.

On cost, we saw significant infrastructure cost reductions once large event types were fully migrated, primarily because of more efficient storage formats, fewer replicas where appropriate, and lower network transfer per request.

On productivity, the time to onboard new enrichments and event types dropped substantially. Most changes moved from bespoke pipeline work to configuration updates and small, composable executors.

On quality, our major recommendation surfaces saw improved engagement metrics after switching to sequences produced by the new platform, while still staying within quality and safety expectations.

Operational Readiness

Throughout migration and into steady state, we invested heavily in observability and operational hygiene.

We set up dashboards tracking sequence freshness and lag, event and enrichment coverage, schema drift and configuration rollout status, and serving latency and error rates..

These foundations turned out to be crucial. A platform that many teams rely on will eventually have bad days; the difference between a minor blip and a major incident often comes down to whether you can quickly see what went wrong and where.

Future Work

There is still plenty to improve, and many of the directions generalize beyond any single company.

We want richer self‑serve tooling so that adding new signals feels more like filling out a template than editing infrastructure code. That includes wizards for new signals, static analysis for configurations, and automated backfill orchestration for common patterns.

We are also interested in stronger correctness guarantees. Anomaly detection over both indexing and serving paths would further harden the system.

Finally, we plan to broaden coverage and add richer signals. That includes extending sequence coverage to more event types and surfaces and adding higher‑level behavioral abstractions on top of raw event sequences, such as session‑level or object‑level views. The challenge is to do that while preserving the core “events → enriched signals → sequences” contract that keeps the platform coherent.

Acknowledgements

A big thank you to everyone who contributed through discussions, design reviews, and recurring syncs that helped shape and unblock this work. In no particular order:
Alekhya Pyla, Chuxi Wang, Han Wang, Jia Zhan, Kangnan Li, Kyle Soares, Laksh Bhasin (He Him), Nilesh Gohel, Se Won Jang, Xue Xia, Yang Tang, Yi He, Anton Arboleda, Yi Pan

And thank you to Archer Liu, Haoyang Li, Hongbo Deng, Qingxian Lai, Shun-ping Chiu, and Yingjian Ding for their great management support.

Making User-Sequence Data More Cost-Efficient, Faster, and Easier to Use was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

An Engineer’s Guide to Better AI Skills: Implementing a Testing Process to Optimize Agent…

Pinterest Engineering — Tue, 12 May 2026 16:01:00 GMT

An Engineer’s Guide to Better AI Skills: Implementing a Testing Process to Optimize Agent Performance in Any Repository or Skill

Author: Daniel Reed

The tech industry is currently seeing a massive overhaul in the way we work and many are enjoying the benefits of AI agents, particularly when automating engineer workflows and serving domain-specific knowledge. However, relying on agents to consistently invoke a custom skill can be surprisingly unreliable at times.

When adopting a new skill intended to help agents write code for Pinterest’s iOS architecture (I’ll call it rx-mvvm) we discovered that sometimes our knowledge skill wasn’t being loaded into our agents. To address this, we conducted a series of tests on Pin-agent (an internal fork of OpenAI’s Codex) and Claude Code to quantify the reliability of skill invocation and identify some best practices to maximize performance. This was a direct result of observing agents struggling to meet the skills bar during architectural reviews. We found that by applying different techniques we could track and drastically improve skill invocation rates on both tested agents.

How to Build A Skill Test Harness

Building a reliable test harness for agent skill invocation requires three key components working in concert. The Core Tool is a Bash script that orchestrates automated testing by piping prompts to your agent and capturing verbose output logs. The core execution is simple:

if echo "$prompt" | claude --print --verbose --output-format stream-json > "$log_file" 2>&1; then
    command_success=true
fi

The script runs all test cases in sequence, collecting logs for later analysis. We ran the entire suite multiple times to account for the nondeterministic nature of agents. Prompts were categorized into two categories defined as arrays:

Positive Cases — 15 prompts covering the full spectrum of skill domains:

CORE_PROMPTS=(
    "load the rx-mvvm-architecture skill"
    "check if this follows rx-mvvm patterns"
    # ... 13 more cases
)

Negative Cases — 5 general programming prompts designed to expose false positives:

EDGE_PROMPTS=(
    "fix this Swift compilation error"
    "write unit tests for this View"
    "refactor this function"
    # ... 2 more cases
)

We then use log parsing heuristics on the json output logfiles to detect skill invocation by searching for telltale patterns in the JSON-streamed debug output.

skill_invoked_claude() {
    local log_file="$1"

    if grep -q '"name":"Skill"' "$log_file" && grep -q '"command":"rx-mvvm-architecture"' "$log_file"; then
        return 0
    elif grep -q 'Launching skill: rx-mvvm-architecture' "$log_file"; then
        return 0
    else
        return 1
    fi
}

The script finally tallies successes across both categories and computes three key metrics with clear formulas:

CORE_SUCCESS_RATE=$(awk "BEGIN {printf \"%.1f\", ($CORE_SKILL_INVOKED / $CORE_TOTAL) * 100}")
EDGE_FALSE_POSITIVE_RATE=$(awk "BEGIN {printf \"%.1f\", ($EDGE_SKILL_INVOKED / $EDGE_TOTAL) * 100}")
OVERALL_ACCURACY=$(awk "BEGIN {printf \"%.1f\", ($TOTAL_CORRECT / $TOTAL_TESTS) * 100}")

What we learned: optimizations

Our initial “vanilla” testing revealed that neither agent could guarantee 100% skill invocation, particularly when engineers used terse or ambiguous prompts. The baseline performance was an overall accuracy of 73% for Codex and 62% for Claude. This low reliability is unacceptable for critical engineering workflows.

Our research confirmed that the performance of both tools can be dramatically improved, with the increase being much greater for Codex than for Claude. We found there were many ways to improve skill invocation rates:

Frontmatter description:
— Including more contextual information (like architectural components) in the skill description in the frontmatter YAML (the section at the top) is a great way to improve performance.
— This gave us measurable gains that were agnostic to agent choice
Aggressive Language:
— Applying aggressive, all caps commands like “YOU MUST LOAD THIS SKILL IF” in the frontmatter is another way to signal importance
— I personally think this is a little silly, not to mention ugly
AGENTS.md:
— Adding a table of skills to the AGENTS.md file, along with reasons to choose to use them is another optional way to improve skill loading
— Teams will want to balance this against the desire to save tokens in their context window by keeping their AGENTS.md files small.
Combination:
— Applying multiple techniques concurrently is a way to compound the gains, but only if you’re a Codex user. We didn’t see these gains matched while using Claude code.
— We also were surprised to find that asking the agents to improve on our additions did not further improve our invocation rates– it actually went down a bit.

Below is a table detailing what we found in our runs. For Codex, we used GPT 5.2-codex and for Claude we were using Opus 4.5.
(100 tests = 5 runs * (15 “positive” + 5 “negative”) tests)

Conclusion

I would be in remiss if I didn’t say that the test prompts we are using are intentionally terse — they’re meant to catch edge cases. This isn’t an indictment of agent skills, models or harnesses. Every single test case during every single run on both agents loaded the skill when the prompt explicitly said ‘load this skill’. The primary method of reliable skill invocation is a good plan, verbose instruction and clear intent from the developer.

The overarching lesson we learned through this process was that not only is it possible to empirically test how often we were loading the skills we expected, it’s something we should encourage, adopt and improve upon so that our agentic AI coding tools become more effective. However, even with a fully optimized skill the engineers working with AI have a responsibility to use high quality and thorough prompts. Teams should follow both of these rules to unlock the full potential of AI agents for domain specific work.

An Engineer’s Guide to Better AI Skills: Implementing a Testing Process to Optimize Agent… was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models

Pinterest Engineering — Fri, 08 May 2026 19:01:00 GMT

Huiqin Xin | Machine Learning Engineer II, Ads Vertical Modeling; Lakshmi Manoharan | Senior Machine Learning Engineer, Ads Vertical Modeling; Karthik Jayasurya | Staff Machine Learning Engineer, Ads Signals; Ziwei Guo | Senior Machine Learning Engineer, Ads Vertical Modeling; Alina Liviniuk | Machine Learning Engineer II, Ads Vertical Modeling

Motivation: The Need for Real-Time Context

In a previous post, Ads Candidate Generation using Behavioral Sequence Modeling, we introduced a candidate generator (CG) that uses a Transformer-based two-tower model to leverage a user’s offsite conversion history — a powerful signal — to predict future interactions with advertisers and specific products. This was a significant step forward, moving beyond static interest categories to model the evolving user shopping journey.

However, a key limitation of the initial sequential model was its lack of online context information. The user embeddings were inferred offline purely from historical offsite behavior, meaning that at the moment an ad was served, the model had no knowledge of what the user was currently browsing on Pinterest. This is a crucial drawback, particularly for highly contextual surfaces like Related Pins and Search, where the user’s current Pin or search query represents a strong, immediate signal of intent. For example, on the Related Pins surface, if a user is viewing a Pin of a “vintage leather armchair,” the recommended ads should be highly relevant to that specific item, not just their general, long-term interests.

This lack of context severely limited the model’s effectiveness on these surfaces; in the previous production system, less than 1% of impressions on Related Pins were attributed to this CG, indicating its candidates struggled to survive the downstream ranking and auction stages.

The Contextual Sequential Modeling Solution

To overcome this challenge, we developed the Contextual Sequential Two Tower Model, an evolution of the sequential recommender model specifically designed to incorporate real-time, online context. This approach focuses on three major areas: a new model architecture, a novel training approach, and a hybrid serving flow.

Model Architecture: Integrating the Context Layer

The core architectural change was integrating a context layer directly into the query tower of the two-tower model.

Figure 1. Contextual Sequential two-tower model architecture

As shown in the diagram above, the model now concatenates the output of the original Transformer encoder (which represents historical sequence information) with the output of the new context layer. This combined representation is then fed into the final Multi-Layer Perceptron (MLP) to derive the final user embedding.

For the Related Pins surface, the context layer’s input features are derived from the subject Pin (the Pin the user is currently viewing), specifically using features like the aggregated embedding representations of the top interest categories of the subject Pin, weighted by their confidence scores.

To further personalize the model, the user representation layer was augmented with embeddings of user demographic features, such as age, country, and gender.

Model Training with Synthetic Context

Since real-time context is only available at serving time, we had to make the model capable of learning from this signal during offline training. The solution was to use synthetic augmented data.

Figure 2. Model training with synthetic augmented data

During model training, we artificially inject pseudo-context information derived from the positive label (the conversion event) into the input sequence. For example, by projecting the interest category features from the positive item, we encourage the model to retrieve items that are semantically related to the context associated with that user session. A high dropout rate is used in the context layer during training to ensure the model still relies on the user’s historical event sequence (the Transformer output).

We opted to use synthetic augmented data over real context data due to two main challenges:

Merging onsite data with offsite data presents significant technical difficulties.
We cannot guarantee that a user has viewed ad impressions on Related Pins between two sequential offsite events.

Hybrid User Embedding Inference

Given that the context features (e.g., subject Pin features) are only known at the ad request time (online), we adopted a hybrid model inference approach.

Offline Inference: The majority of the user tower (the Transformer encoder) is inferred offline, and the last hidden state of the transformer (the encoded representations of the event sequence) is stored in the feature store. This is refreshed on a daily basis for users with new offsite activity.
Online Inference: The remaining part of the user tower — the context layer and the final MLP head — is computed online at serving time, taking the real-time context features and the pre-computed offline user signal as inputs.

This architecture and serving flow enables the user embedding to be dynamically influenced by the real-time context, ensuring the recommendations are both personalized (from sequence) and contextually relevant.

Results and Business Impact

Offline evaluation

To assess the impact of integrating context features on the survival rate of model-retrieved ad candidates, we conducted an offline evaluation. Using logged features from real traffic ad data on Related Pins, we generated the model output embedding and calculated Recall@K, which measures the proportion of positive items found in the top-K retrieved items. Here the candidates that survived the ranking funnel and delivered to the users were considered positive items. This new model demonstrated a significant improvement, achieving a 3x to 10x increase in Recall@K compared to the production model.

Table 1. Recall@K for production model and contextual model

Survival Rate & Relevance

We were able to successfully drive up the survival rate of the candidates from this CG on the Related Pins surface. The median relevance of the candidates went up by ~275–300%. On the Related Pins surface overall, the ads relevance metric improved by 1.08%. Furthermore, we observed a significant increase in candidate delivery, with 2x more ads candidates retrieved being delivered to impression.

Topline Business Metrics

The improvement in candidate relevance translated into ~0.7% measurable lift in conversion-related business metrics ROAS (Return on Ad Spend). In particular, the model benefits more for top countries which account for a majority of total revenue and leads to ~1.4% ROAS lift.

Future work

We plan to explore several key enhancements:

Context Surface Expansion: A key next step is to extend the context-enhanced candidate generator to other high-stakes contextual surfaces, notably Search. This is particularly crucial for Search because maintaining high relevance between the presented ad candidates and the user’s search queries is paramount.
Advanced Fusion Techniques: Move beyond simple concatenation of context layers with the sequential encoder output. We propose using cross-attention-based fusion, where the context layer embedding acts as the query and the sequence of encoded transformer outputs serves as the key/value. This approach will allow the final user-tower embedding to dynamically capture the importance of each history event based on the real-time context.

Acknowledgements

We would like to thank Supeng Ge, Yang Liu, Richard Huang, Yu Liu, Zhuqing Zhang, Kevin Liao, Yu Gu, Wanyu Zhang, for their dedicated help; thank to Alice Wu, Leo Lu, Siping Ji, Ling Leng for their incredible support and leadership; thank to Joachim Groeger for the valuable discussion and support.

Enhancing Ad Relevance: Integrating Real-Time Context into Sequential Recommender Models was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

Pinterest Engineering — Fri, 01 May 2026 16:01:01 GMT

Guangtong Bai | Staff Software Engineer, Product ML Infrastructure*; Shantam Shorewala | Software Engineer II, Product ML Infrastructure*; Chi Zhang | Staff Software Engineer, AI Platform*; Neha Upadhyay | Software Engineer II, AI Platform*; Haoyang Li | Director, Product ML Infrastructure

*These authors contributed equally to this article.

Background

At Pinterest, our online ML serving systems employ a root-leaf architecture. On a high level, the architecture looks as follows:

Figure 1: Root-leaf Architecture of Online ML Serving Systems at Pinterest

In the diagram, “Client Service” is responsible for recommending organic or promoted Pins to users. In order to know if a given Pin is relevant to a particular user request, client service sends a score request to the online ML serving system to have the Pin scored by a bunch of ML models, each of which scores an aspect of “relevancy”.

The online ML serving system is composed of 2 parts:

Root: This component handles initial feature processing. Its responsibilities include retrieving necessary features from the feature store, performing required preprocessing, and distributing (fanning out) the scoring requests to the various leaf partitions.
Leaf: This is where the actual model inference takes place, typically utilizing GPU machines. It is structured into multiple partitions, each of which hosts a related group of models, such as one production model and several experimental variants.

What is flowing between the services are ML features. In this blog, we share how passing too many features from root to leaf created a network bottleneck and how we resolved it with Feature Trimmer.

Motivation

The root-leaf architecture provides us with significant benefits, namely:

Simplified Model Onboarding: New ML models can easily be onboarded for online serving by creating new leaf partitions, transparent to root and upstream clients.
Reduced Feature Store QPS: The system minimizes RPCs to the feature store for fetching ML features by having all leaf partitions share a large in-memory feature cache in the root.
Optimized Resource Utilization: Separating CPU (feature fetching, preprocessing) and GPU (model inference) workloads allows for optimized resource use, improving efficiency and reducing cost.

However, this setup introduced a new challenge — the network bandwidth between root and leaf became a performance bottleneck on the online serving path; we had to scale the system based on network usage rather than compute. We observed this pressure in the Ads server on both the root and leaf partitions:

On leaf partitions, peak network usage was significantly higher than peak GPU SM activity (see Figure 2). Consequently, the network bottleneck prevented us from fully utilizing the available GPU compute power.
On root, we had to use the network optimized AWS instance type m6in to ensure the server latency met our internal SLA.

Figure 2. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server

That led to a straightforward idea: reduce the root-leaf network bandwidth usage to unlock immediate fleet downscaling and infrastructure savings. If we could cut bandwidth enough, we could also move the root from network-optimized m6in instances to standard m6i instances (about 20% cheaper), further reducing cost.

Enable compression to reduce network usage

The most direct way to reduce the root-leaf network bandwidth usage is to compress the requests between them.

This compression strategy is well-suited for the requests sent from the root to the leaf, which primarily carry ML features for multiple candidate Pins for a given user request. These requests are compressible for several reasons:

Feature Set Consistency: The set of features requested is identical across different candidate Pins, although the actual feature values vary.
Feature Similarity: There are groups of features that share similar representations (e.g., last_x_pins_user_viewed and last_x_pins_user_clicked )
Sparsity: Many features are sparse, containing numerous empty or zero values.

After a few quick tests, we enabled lz4 compression in fbthrift (the RPC framework used by root and leaf) for root-leaf traffic. That reduced 20% root-leaf network usage, at the cost of 5% CPU usage increase and 5ms (~10%) p90 latency increase.

Compression was a solid early win, but it didn’t change the underlying problem: we were still shipping too much unused data. The bigger lever was to stop sending unused features altogether, which led to our “Send What You Use” approach.

Send What You Use

In our root–leaf architecture, the root is shared across many leaf partitions and must fetch ML features for all models. To minimize feature store QPS, the root fetches the union of features needed across models (per candidate Pin), stores them in an efficient in-memory cache, and then fans out the full feature set to each leaf model. Each model converts and uses only the features it needs; the rest are effectively discarded before inference.

This approach was acceptable in our prior architecture, where the same GPU host handled both feature fetching/preprocessing and local model inference. In that context, the unnecessary features only increased main memory usage, which was not a bottleneck on GPU machines. However, within the new root-leaf architecture, transmitting these unneeded features across the network introduces a significant efficiency problem.

If we could send only the required features and trim everything else, similar to C++’s “include what you use” header management tool removing unnecessary #include’s, we could potentially cut root-leaf network usage by ~50%. Like compression, this trades network savings for some additional CPU work and potential latency overhead.

Figure 3: Overview of the ML inference engine with root-leaf setup and feature trimming

To make this work, the root must know the exact feature list required by each leaf model. Since models refresh continuously, we also need to keep the feature allowlist on root in sync with the feature expectations of the latest model version on the leaf.

Source of Truth: Model Signature

The source of truth for which features are needed by a model is its model signature. Model signature defines the inputs and outputs of a model, similar to a function signature. As a version of a model finishes training, its model signature is exported as an extra file alongside the TorchScript artifact in the .pt archive file. Below is what a model signature looks like:

❯ unzip -p model.pt archive/extra/module_info.json | jq
{
  "input_names": [
    "feature_id_1",
    "feature_id_2",
    "feature_id_3",
    ...
  ],
  "output_names": [
    "output_score_1",
    "output_score_2"
  ]
}

When the leaf loads a specific model version from the .pt archive, it not only deserializes the weights from the TorchScript artifact, but also builds a feature converter from the model signature. The converter transforms input features from internal company format into native PyTorch tensors before passing them to the model. Because it knows the model’s inputs, it converts only the required features and discards the rest.

A crucial convention is that a model’s signature remains unchanged across different versions. If a signature modification is necessary — for instance, to introduce a new input feature — a new model is forked from the original. This practice is essential because it underpins the fallback mechanism for the versioned lookup feature of the Feature Trimmer, a topic discussed in detail later in the “Versioned Lookups and Fallback” section.

Model Deploy Synchronization

Feature Trimmer only works if the root knows exactly the features that the leaf model expects. That sounds simple until you factor in reality: models are refreshed frequently (hourly to daily), multiple models are shipped together as a “bundle”, and rollouts happen gradually (canary → prod, rolling deploys, occasional rollbacks).

This section explains how we keep the root up to date with what’s actually deployed on the leaf without adding heavy runtime dependencies or introducing brittle, manually managed configs.

At a high level, our approach is:

Treat the model signature as the source of truth which is exported as module_info.json.
Publish signatures as lightweight artifacts that can be consumed by deployment pipelines.
Aggregate per-model signatures into a per-bundle artifact that is deployed to the root alongside existing root configs.
Use the same staged delivery semantics as model rollout (canary, automated canary analysis, prod, rollback), so trimmer config changes ride the same operational rails as everything else.

Figure 4: Root configurations artifact generation and delivery integrated with existing model deployment

Publish module_info.json as a standalone artifact

To make the model signature easy to ship and consume, we export module_info.json as a standalone file as part of the model training workflow, next to other model files (for example, alongside the model artifact and config files). This is important for synchronization as it ensures signatures are available before deployment, and available in a form that can be aggregated and deployed without any heavy runtime dependencies.

Generate a bundle-level module_info mapping during bundle build

In production, roots don’t serve a single model, they typically serve bundles containing multiple models (and sometimes multiple versions during a rollout window). So instead of deploying N per-model signatures independently, the bundle pipeline generates one bundle-level artifact that looks like:

{
  "model_A": [
    {
      "version": "1",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    },
    {
      "version": "2",
      "input_names": ["feature_id_1", "feature_id_2", "..."],
      "output_names": ["score_1", "..."]
    }
  ],
  "model_B": [
    {
      "version": "7",
      "input_names": ["feature_id_9", "..."],
      "output_names": ["score_x", "..."]
    }
  ]
}

During the build step, the model deploy pipeline iterates over the model versions that will be shipped in the bundle.

If a model version includes module_info.json, the pipeline parses it and records the signature.
If the signature is missing, the pipeline logs a warning and skips that version rather than failing the entire build. This keeps the system resilient while signature publishing is being rolled out across use cases.

Finally, the bundle-level module_info file is packaged and uploaded together with other root configuration files, so the root receives one coherent “ configs” package.

Deploy root configs through the same staged delivery flow

Once the bundle build produces the root-config package, deployment follows the standard staged delivery pattern:

Deploy root configs to Canary
Deploy model configs to Canary
Run Automated Canary Analysis (ACA)
Deploy root configs to Production
Deploy model configs to Production

This is important because it integrates the feature trimmer into the existing model deployment system and ensures that the “root’s trimming view of the world” is updated using the same guardrails and rollback mechanics as other model changes.

We deploy root configs before rolling out new leaf model versions because the feature trimmer keys feature allowlists by model name + version. If a versioned request arrives without a matching allowlist, we skip trimming to avoid stale configs, which can cause a temporary rollout gap. To prevent this, we ship a backwards-compatible root artifact containing allowlists for both the current and pending versions. Discussed in more detail in a later section “Versioned Lookups and Fallback.”

On successful completion, the root hosts receive the bundle-level signature mapping at a known location on disk, and the trimmer can begin using it for per-model feature allowlisting.

A Closer Look into Trimmer Internals

Feature Allowlist or Blocklist

Once the root hosts have an idea of which features each model requires, we only keep the needed features in the fan-out request to leaf partitions. This allowlist approach, compared to a blocklist where we keep features not in the list, does not carry the burden of tracking all the features that might be in development or deprecated. Given the evolving nature of ML models and volume of experiments at Pinterest, the blocklist is significantly larger for any given model and it is probable that it will grow faster than the allowlist in the future.

Concurrent Updates Across Model Bundles

As mentioned earlier, a model bundle can contain multiple ML models. Additionally, the model bundles do not map 1:1 to the root cluster — each root cluster can receive traffic for multiple bundles. The bundles, each with their own module_info artifact, are deployed independently and often at different cadence. Further, we need to support independent rollbacks for even a single model bundle.

Figure 5: Concurrent update handling for multiple bundles

A feature trimmer module is initialized on each root host when it comes online. This module maintains a consolidated, in-memory mapping from models to their versioned feature allowlist. Each trim request is efficiently serviced by looking up the model name and version within this consolidated map. The consolidated map uses the model name and version as nested keys for fast read access as follows.

{
  "model_A": {
"version_N": ["feature_id_1", "feature_id_2", "..."],
 "version_M": ["feature_id_1", "feature_id_2", "..."],
  },
  "model_B": {
"version_N": ["feature_id_3", "feature_id_4", "..."],
 "version_K": ["feature_id_4", "feature_id_5", "..."],
  },
}

This per-model feature allowlist map needs to be continuously refreshed as the model bundle is updated. Here is how it is managed:

Configuration: The root cluster is configured with the active model bundles, and the file path for each corresponding module_info.json is set using GFlags.
Initial Loading: The feature trimmer module loads the content of each module_info.json file into an independent in-memory map.
Monitor for Content Updates: A file watcher is attached to each module_info.json. Any content refresh triggers a reload of its contents into the in-memory map for the given model bundle.
Consolidation: On initial loading or when any model bundle is refreshed, the module:
— Scans and merges all independent maps.
— Creates a new consolidated map.
— Atomically replaces the current active consolidated map with the new one.
Concurrency Management w/ Read-Write Lock:
— Concurrent reads of the consolidated and independent maps are managed with a shared lock.
— Write access during the map replacement is managed with a unique lock.

Versioned Lookups and Fallback

Figure 6: Request flow for versioned lookup and fallback

Each scoring request sent to the root cluster must include the model name and optionally, the model version. If the version is omitted, it defaults to the latest version. The feature trimmer parses these fields to determine the version-specific feature allowlist for the requested model.

If no feature allowlist exists for the model, the request proceeds untrimmed.
If both model name and version are specified and found, the specific version’s allowlist is used.
If the model name is found but the version is either not specified or not found, the trimmer uses the latest version of the allowlist. This design choice is based on the assumption at Pinterest that the model signature remains consistent across versions, which also simplifies the deployment by avoiding the need to keep multiple versions in memory during a rolling deployment.

The adoption of the feature trimmer is expected to reduce network bandwidth consumption for root-leaf connections. This places the trimmer on the critical failure path: failure to trim score requests can cause a significant spike in network bandwidth, potentially leading to cascading failures. Therefore, robust handling of artifact (module_info.json) corruption or deployment failures is essential.

We have implemented the following safeguards:

Initialization Failure Railguard: Upon Feature Trimmer module initialization, any failures while parsing the required module_infoartifacts are emitted to our observability dashboard and trigger an on-call alert. We specifically chose not to block host launch on initialization failure. This decision preserves our ability to respond to capacity-related incidents, especially if a deeper issue is affecting the Feature Trimmer module itself.
Isolate Failures from a Single Model Bundle: The feature trimmer loads the module_info contents for each model bundle into a separate map in its memory. If a model bundle’s file gets corrupted on disk during an update, the feature trimmer keeps using the old, in-memory version for that bundle. Because each bundle has its own map, the feature trimmer can still successfully update the information for all the other model bundles.

The fundamental assumption that the model signature is consistent across different model versions allows us to implement these precautions, ensuring the Feature Trimmer remains reliably operational even in the event of intermittent deployment failures.

Efficiency Wins

Reduced Network Stress

Ads root-leaf server setup was the biggest beneficiary of this launch. Figures 7 and 8 compare the network performance of the Ads root and leaf clusters post the launch of the feature trimmer module.

Figure 7. Comparison of the network bandwidth usage vs GPU SM activity on a subset of the leaf partitions of the online ML server after feature trimmer was enabled. The reduction in network usage allowed us to tune the cluster size and batch size config to improve the GPU utilization.

Figure 8: Comparison of the network bandwidth consumption before and after launch of the feature trimmer on the Ads root cluster. It dropped from a peak of 4GBPS to <1.5GBPS even after downsizing the root cluster by 27%.

Figure 9: Comparison of network bandwidth performance on Ads leaf partitions after the launch of the feature trimmer. The peak usage dropped from 1000–1200 MBPS in some clusters to <200MBPS for all clusters.

Later, we also applied the feature trimmer to other use cases such as HomeFeed and Related Pins and saw latency and network reductions similar to Ads, amplifying the overall impact of this initiative. Figures 10 and 11 show the network savings in Homefeed Root and Leaf.

Figure 10: In our Homefeed Root cluster, outbound network usage dropped substantially from ~1.2–2.1 GB/s to ~0.45–1.1 GB/s

Figure 11: We saw 65–75% reduction in inbound network usage across Homefeed GPU leaf clusters

As a result, we reduced the Homefeed root cluster fleet size by 33% and are still working on rightsizing the Homefeed leaf clusters, unlocking significant infrastructure savings.

Latency Improvement

While the payload size reduction directly contributed to the network performance improvement, we also saw a reduction in CPU utilization on the root cluster and a reduction in both server-side and client-side root latency. We believe this is largely because a smaller payload leads to less CPU cycles spent on SerDe (serialization/deserialization). This additional latency headroom allowed Ads to save additional cost by trading some latency for cost and the remainder was used to unblock future experiments (see latency increases in late June).

Figure 12: Ads client (AdMixer) P90 latency dropped significantly as well, peaking above 90ms prior the launch to <80ms peak after feature trimmer was enabled.

For our Related Pins surface, the model score latency p99 (ms) before the feature trimmer for most models sits around ~130–180 ms with frequent spikes above 200 ms. After the feature trimmer is enabled, the p99 baseline shifts down to roughly ~95–125 ms for most models, a notable ~25–30% drop in latency.

Figure 13: Feature Trimmer reduces Related Pins model p99 latency by ~25–30%. Note that the feature trimmer was not available for some models because they did not have a valid feature allowlist so these models still see the same peak latency post rollout.

Cost Saving

Based on the efficiencies realized in terms of network performance and client latency, we were able to resize the ML servers at Pinterest to realize significant cost savings:

Ads was the biggest beneficiary of this project — the team could downsize the root cluster by 27% without any performance regression. On the leaf side, the network improvement allowed us to tinker with the batching logic to finetune GPU utilization without impacting any other metrics, representing roughly 5% of the total GPU capacity at the time.
— The latency reduction unblocked future improvements and marginally reduced the failures due to server timeouts — this led to a marginal 0.17% increase in revenue as well.
Across other use cases like Search and Notification, we saw approximately 45% and 65% drops in egress network throughput, with no material change in p99 latency. Because these clusters were initially network-bound, feature trimmer allowed us to move to more optimized instance types, resulting in ≥30% cost reduction for both.
— This realized an additional $0.98M in annual infrastructure cost from rightsizing the clusters

Overall, this project saved over $4M in annual infrastructure costs for Pinterest while creating headroom to test bigger models and features without latency or network performance concerns. It effectively shifted the bottleneck from network to CPU cycles on the root cluster. This also allows the team to switch focus to optimizing the payload between the client and the root to further finetune the resource utilization end-to-end.

Wrap Up

Feature Trimmer successfully addressed a critical network bottleneck in Pinterest’s root-leaf ML serving architecture, moving beyond simple payload compression to implement a “Send What You Use” philosophy. By establishing the model signature as the source of truth for required features and deploying a robust, version-aware feature allowlisting system in sync with model rollouts, we significantly reduced the data volume passed between the root and leaf clusters. This optimization resulted in substantial network bandwidth reduction, improved client-side latency, and ultimately delivered significant cost savings.

In Part II of this blog series, we will shift focus to how request feature compression further optimizes the network connection between the client and the root. Keep an eye out for the next installment to discover how we achieve even greater efficiencies in our ML serving infrastructure.

Acknowledgement

This project would not have been possible without former team members Yiran Zhao and Queena Zhang’s early exploration and prototyping. We extend our sincere gratitude to the following individuals for their invaluable support in deploying Feature Trimmer into production: Miao Wang, Randy Carlson, Runze Su, Qifei Shen, and Tao Mo. We would also like to thank Nazanin Farahpour, Howard Nguyen, Bo Liu, Sihan Wang, Renjun Zheng and Zheng Liu for their helpful review of this blog post.

Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation at Pinterest

Pinterest Engineering — Mon, 27 Apr 2026 16:01:05 GMT

Introduction

At Pinterest, conversion ads are crucial for matching users with products they are likely to purchase, boosting value for both users and advertisers¹. While conversion actions like checkout or add-to-cart are highly valuable, they are also technically challenging to optimize for. Because they occur offsite, conversion events are significantly sparser and noisier than onsite engagement signals. Historically, Pinterest’s shopping ads retrieval relied on engagement-based models. While effective for driving interaction, this system was not designed to optimize for lower-funnel conversions. This gap motivated us to build a dedicated candidate generation model tailored for conversions, aiming to surface higher-intent products and improve advertiser performance.

We launched our first shopping conversion model in 2023, achieving meaningful wins across both conversion and engagement, including a higher clickthrough rate (CTR). Further iterations in 2025 unlocked even stronger conversion value and improved Return on Ad Spend (RoAS) for our advertisers. This blog post documents our journey building this conversion candidate generation model, from its technical design and challenges to the key learnings of deploying it to our 600+ million monthly active users at Pinterest.

Training Data Design

Modeling conversion events is challenging. Unlike frequent, real-time onsite engagements (e.g., clicks), offsite conversions are reported by advertisers, making the data sparse, noisy, and delayed. Despite these difficulties, conversions remain one of the most valuable signals for a purchase intent model, offering a far stronger indication of advertiser value and true user intent than engagement alone. To address the inherent sparsity of conversions, we made several key design decisions:

Multi-Surface Model: We train a single model across all shopping surfaces (Homefeed, Related Pins, Search) to avoid fragmenting sparse conversion labels. At the same time, we incorporated surface-specific features to learn contextual differences between these surfaces.
Dual Positive Signals: We supplement primary conversion signals with onsite engagement data (clicks, repins). This broadens data coverage, improving model generalization and ad funnel survival rates. To mitigate click data noise and decrease false positive clicks, we apply a log-based re-weighting function w based on the click duration:

where t is the non-negative click duration in seconds and tₘₐₓ is a tunable constant used to cap the re-weighting function.

Negative Sampling: On top of the existing in-batch negatives, we use ad impressions with no engagement as “harder negatives.” These samples can reflect the real distribution of served ads, exposing the model to a more representative inventory and promoting robust contrastive learning.

In summary, our multi-task approach uses engagement prediction as an auxiliary task to stabilize training and boost performance. The crucial challenge is balancing the two tasks, ensuring the high-value conversion signal is not diluted by the more frequent engagement data.

Feature Engineering

At the core of our model are features that capture critical signals about our users and shopping catalog, grouped into two categories: User-side and Pin-side.

User-side features are split into two types. First, context features capture a user’s real-time intent, which is vital for applications like Related Pins and Search. Examples include a subject Pin’s visual and GraphSAGE² embeddings. Second, preference & historical features capture long-term interests for personalization. These include demographics, aggregated historical actions, and sequential data processed by a Transformer to create a user history embedding.

Pin-side features take a multi-faceted approach, incorporating ID features, multi-modal/ content features for semantic understanding, and performance features tracking engagement.

This structured representation of users and Pins ensures an effective matching process, delivering both personalization and relevance in recommendations.

Model Architecture and Loss function Design

We use a two-tower model for retrieval, where user and Pin features are encoded separately, as there are no explicit user-Pin interaction features at this retrieval stage. To capture richer relationships among features within each tower, we employ DCN v2 (Deep & Cross Network v2)³ as the foundation of our cross layers. This enhances the model’s capacity to model non-linear interactions and boosts retrieval quality. After the cross layers, the output embeddings are fed into the final MLP head(s).

1. Parallel DCN v2 and MLP Cross Layers Architecture
Early in our iterations, our cross-layer design was simple: a stacked architecture where DCN v2 cross network processed the input first, feeding its output into an MLP for dimension reduction. While efficient, we hypothesized that this sequential arrangement imposed a fundamental limit on the model’s learning capacity. To move beyond the sequential design, we designed a new parallel architecture by adding an MLP in parallel (see Figure 1). Its success stems from eliminating the primary drawback of a sequential flow: the information bottleneck. In the old setup, the MLP could only learn from features already processed by DCN v2, potentially losing valuable signals from the original input.

Figure 1: Sequential (left) and Parallel DCN v2 and MLP (right) Cross Layers Architecture

In contrast, our parallel design allows both the cross network and the deep network to learn directly and simultaneously from the same input features. This effectively decouples the learning tasks, the cross network captures richer and more expressive explicit feature interactions by applying cross operations that combine the original input with each successive layer’s output to construct higher-order feature crosses, while the 3-layer MLP learns implicit abstract patterns in parallel. Because the cross network always references the original input at every layer, it constructs higher-order feature crosses without any information being lost or distorted by a preceding MLP transformation. The combined output of both funnels yields a richer and more expressive representation, unlocking a higher level of performance.

We applied this design to both the Pin and query towers, validating it on the conversion task where it delivered a +11% gain in offline recall@1000⁴. Given its success in boosting core learning ability, particularly in its ability to surface stronger feature interactions while keeping a low latency for the retrieval task, this parallel architecture was subsequently adopted by all our production engagement retrieval models, achieving similar recall improvements as well as significant gains in online metrics.

2. From a Multi-Head to a Unified Multi-Task Architecture
In the first version of our model, we designed a multi-head structure to comprehensively make use of the conversion data and engagement data. To leverage the relative abundance of click data, we used a multi-head architecture with shared encoders followed by engagement and conversion heads. The engagement head helped stabilize shared parameters, while the conversion head preserved the unique purchase-intent signal. The two heads were trained simultaneously using a distinct sampled softmax loss (see Figure 2). To balance the influence of engagement data without diluting the conversion signal, different loss weights were applied. At serving time, only the conversion Pin and query embeddings were used.

Figure 2: Multi-head architecture, 2023 (left) and Unified multi-task architecture, 2025 (right)

Through in-depth data analysis and several online experiments, we identified sparsity and noise in the conversion labels as one of the main bottlenecks of the previous model performance. To better stabilize query embeddings in regions of low conversion coverage, we moved from a multi-head architecture to a unified single-head multi-task architecture (cf. Figure 2). By merging the conversion and engagement heads, it allows the final embeddings to directly benefit from the multi-task optimization during serving.

Building on top of this, we also observed that conversion data at the Pin level exhibit high variance, making it challenging to reliably model purchase intent from Pin-level supervision alone. To address this, we introduce an advertiser-level loss function as an additional training objective, enabling the model to better capture conversion signals at a more stable and consistent granularity. With other model improvements and feature additions, we saw on average an increase of +42% recall@100⁴ for conversion tasks compared to our previous 2023 model.

Conclusion

In summary, our modeling journey in crafting the shopping conversion candidate generation was driven by the necessity of overcoming the inherent sparsity and noise of offsite conversion events. We addressed this through a sequence of loss design and architectural innovations. Key modeling decisions included the adoption of a unified model across all surfaces and the strategic use of conversion and click duration-weighted engagement data. Architecturally, we leveraged a highly effective Parallel DCN v2 and MLP Cross Layers architecture, and we progressed from an initial separate multi-head design to an unified multi-task architecture that introduced an advertiser-level matching objective to better align with the natural granularity of the conversion signal.

Introducing this new CG to production in 2023 delivered a 2.3% increase in shopping conversion volume and a 2.7% lift for the shopping impression to conversion rate. Beyond conversions, it also improved the Pinners’ shopping experience, with CTR increasing by 1.5% and CTR over 30 seconds rising by 2.2%. Building on this foundation, further iterations and refinements throughout 2025 continued to push the model’s performance forward, resulting in a 3.1% improvement in RoAS for US shopping campaigns⁴, reinforcing that strong advertiser outcomes and a great Pinner experience are not at odds, but deeply intertwined.

Acknowledgments

Ads Retrieval: Yang Liu, Jay Ma (former), Peifeng Yin (former), Qingmengting Wang, Richika Sharan, Jitong Qi, Yufeng Su, Huiqin Xin

Ads Ranking: Weiwei Ying (former), Yiwei Sun (former), Aayush Mudgal, Hongda Shen, Han Sun

Ads Signal: Jiayin Jin (former), Daniel Yang (former), Chongyuan Xiang, Lakshmi Manoharan, Litian Tao, Siping Ji

Leadership: Alice Wu, Leo Lu (former), Ling Leng (former), Hari Venkatesan (former), Behnam Rezaei (former), Jamieson Kerns

References

¹ A. Mudgal, et al. 2024. Evolution of Ads Conversion Optimization Models at Pinterest. Pinterest Engineering Blog.

² W. L. Hamilton, et al. 2017. Inductive Representation Learning on Large Graphs. In NIPS.

³ R. Wang, et al. 2020. DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems. WWW ’21: Proceedings of the Web Conference 2021.

⁴ Pinterest Internal Data, US, 2023 to 2025.

From Clicks to Conversions: Architecting Shopping Conversion Candidate Generation at Pinterest was originally published in Pinterest Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest

Pinterest Engineering — Mon, 20 Apr 2026 16:01:04 GMT

Shanhai Liao | Senior Software Engineer, Content Acquisition and Media Platform; Di Ruan, | Senior Staff Software Engineer, Content Acquisition and Media Platform; Evan Li, | Senior Engineering Manager, Content Acquisition and Media Platform

Introduction

Accurate content understanding underpins Pinterest’s ability to drive distribution and engagement. This requires deep insight not just into the image itself, but also the outbound links or items to which those images point. At the foundation of this process lies a deceptively simple problem: URL normalization.

When Pinterest ingests content from millions of merchant domains, the same product page often appears under many different URLs. A single pair of shoes might be referenced by dozens of URL variations — each one decorated with different tracking parameters, session tokens, or analytics tags. While downstream systems can eventually deduplicate by content identity, the inability to recognize these duplicates at the URL level means every variation is independently fetched, rendered, and processed. At scale, this redundant ingestion and processing represents a significant waste of computational resources — rendering the same page dozens of times simply because its URLs differ in irrelevant parameters.

Item canonicalization — ensuring that identical items represented by different URLs are unified — is critical for organizing shopping catalogs and presenting a consistent experience to users. For many partners, a provided item ID determines canonical identity, but in its absence, the onus falls to advanced URL normalization to deduplicate effectively.

This post details the technical journey behind the Minimal Important Query Param Set (MIQPS) algorithm: a system that automatically learns which URL parameters matter for content identity, enabling dynamic and precise URL normalization at scale.

Background: The URL Normalization Challenge

Consider a typical product URL from an e-commerce site:

https://example.com/shoes?id=42&color=red

This URL identifies a specific product variant. But in practice, the same product page is often reached through URLs like:

https://example.com/shoes?id=42&color=red&utm_source=facebook&session=abc123
https://example.com/shoes?id=42&color=red&ref=pinterest&click_id=xyz
https://example.com/shoes?id=42&color=red&tracking=campaign_spring

Figure 1: The URL duplication problem. Multiple URLs with different tracking parameters all resolve to the same product content.

Caption: Figure 1: Multiple URLs with different query parameters all point to the same underlying product page.

The parameters utm_source, session, ref, click_id, and tracking are all neutral - they don’t change the content of the page. Meanwhile, id and color are non-neutral - they determine which product and variant are displayed.

The challenge is distinguishing between the two. For well-known e-commerce platforms, this can be solved with curated rules. Shopify URLs, for example, use variants as the key product differentiator. Salesforce Commerce Cloud uses parameters like start, sz, prefn1, and prefv1. For these platforms, static allowlists are sufficient.

But Pinterest ingests content from a large number of domains, operating on a wide variety of platforms.

For this long tail of domains, URL parameter conventions vary wildly. Static rules cannot scale to cover them all. We need a dynamic, data-driven approach.

The MIQPS Algorithm

The core insight behind MIQPS is straightforward: if removing a query parameter changes the content of a page, that parameter is important; if it doesn’t, the parameter is noise and can be safely stripped. Crucially, this analysis runs independently per domain — each merchant site gets its own MIQPS map, because the same parameter name can be meaningful on one domain and irrelevant on another.

The algorithm operates in three steps.

Step 1: Collect the URL Corpus

As Pinterest’s content ingestion pipeline processes URLs from domains, the system accumulates a corpus of observed URLs per domain. This corpus is stored durably and represents a snapshot of all the URL variations seen for a given domain. It serves as the input to the MIQPS analysis.

Step 2: Group URLs by Query Parameter Pattern

Not all URLs from a domain share the same set of query parameters. A product page URL might carry {id, color, utm_source} while a category page might carry {category, page, sort}. Analyzing them together would be meaningless.

Moreover, the same parameter name can play different roles depending on its context. Consider the parameter `ref`: on a product page URL like `example.com/product? id = 42 & ref = homepage`, `ref` is purely a tracking parameter and is neutral - removing it doesn’t change the product displayed. But on a comparison page URL like `example.com/compare? ref=99`, the same `ref` parameter identifies which items to compare and is non-neutral. By grouping URLs by their full parameter pattern, the algorithm evaluates each parameter within its specific context, correctly classifying it as neutral in one pattern and non-neutral in another.

To address this, the algorithm groups URLs by their query parameter pattern — the sorted set of parameter names present in the URL. For example:

URLs sharing the same query pattern are grouped together. The top K patterns by URL count are selected for analysis, focusing computational resources on the patterns that matter most.

Step 3: For Each Pattern, Test Each Parameter

For each query parameter within a pattern, the algorithm determines whether it is neutral or non-neutral through empirical testing:

1. Sample: Select up to S URLs with distinct values for the parameter under test.

2. Compare: For each sampled URL, compute the content ID — a fingerprint derived from the page’s rendered visual content — for both:
— The original URL (with the parameter present)
— A modified URL (with the parameter removed)

3. Classify: If removing the parameter changes the content ID in at least T% of samples, the parameter is classified as non-neutral (important). Otherwise, it is neutral (safe to drop).

The content ID is a hash of the page’s visual representation, meaning two URLs that render the same visible content will produce the same content ID, even if their underlying HTML differs slightly. This particular fingerprinting approach leverages Pinterest’s in-house page rendering infrastructure, which is tailored to our content pipeline. The core MIQPS algorithm, however, is agnostic to how the content fingerprint is produced — it only requires a function that returns the same identifier for the same page content. Third parties looking to adopt a similar approach could substitute alternatives such as DOM tree hashing, HTTP response body checksums, or even simpler heuristics like comparing the `` and Open Graph metadata across URL variants. The key principle remains the same: compare some representation of the page content with and without each parameter to determine its importance.A natural question is: why not simply use the **canonical URL** declared in the page’s HTML (via the `<link rel=”canonical”>` tag) to resolve duplicates? If the merchant provides a canonical URL, two variant URLs pointing to the same product should share the same canonical, making deduplication trivial. In practice, however, canonical URLs are unreliable at scale. Many merchant sites omit them entirely, set them incorrectly (e.g., pointing every page to the homepage), or include tracking parameters in the canonical URL itself. Because we cannot assume canonical URLs are present or correct across the long tail of merchant domains, MIQPS uses visual content comparison as a ground-truth signal that works regardless of how well-maintained a site’s metadata is.<h3>Algorithm Parameters</h3>The behavior of the MIQPS algorithm is governed by a small set of tunable parameters:<figure></figure>Two additional design choices make the algorithm practical at scale:<ul><li>Early exit optimization: If the mismatch rate already exceeds T% after N successful tests, we stop testing that parameter early. This avoids unnecessary page rendering calls for parameters that are clearly non-neutral.</li><li>Conservative default: When fewer than N sample URLs are available for a parameter, it is treated as non-neutral by default. The system errs on the side of keeping parameters rather than dropping ones that might matter.</li></ul><h3>Putting It Together</h3>Figure 2: The MIQPS computation pipeline.The output of this pipeline is a MIQPS map: a mapping from each query parameter pattern to the set of non-neutral parameters within that pattern. This map is published to a configuration store and consumed at runtime during URL normalization.<figure></figure><h3>Multi-Layer Normalization Strategy</h3>MIQPS does not operate in isolation. In production, URL normalization combines static rules with the dynamically computed MIQPS. Static rules capture known conventions — curated allowlists for recognized e-commerce platforms and regex patterns for widely used parameter naming schemes. These rules handle cases where we already have high confidence about which parameters matter.MIQPS complements these static rules by covering the long tail of domains where no predefined rules exist. A URL parameter is kept if it is matched by either the static rules or the MIQPS non-neutral set. Only parameters that pass neither check are stripped. This combination ensures broad coverage: static rules provide immediate, reliable handling for known platforms, while MIQPS dynamically adapts to everything else.<h3>Anomaly Detection: Guarding Against Regressions</h3>Computing MIQPS is inherently dependent on external page rendering. Pages can change, rendering infrastructure can have transient issues, and a domain’s URL structure can shift between analysis runs. Without safeguards, a bad MIQPS computation could cause the system to start dropping parameters that are actually important — leading to content deduplication errors and degraded catalog quality.To address this, the system includes an anomaly detection layer that compares each newly computed MIQPS against the previously published version. The comparison follows a set of conservative rules:<ul><li>Parameter removed from non-neutral set (anomaly): If a parameter that was previously classified as non-neutral is now classified as neutral, the pattern is flagged as anomalous. This is the dangerous case — it means we would start stripping a parameter that we previously determined was important.</li><li>Parameter added to non-neutral set (not anomalous): If a previously neutral parameter is now classified as non-neutral, this is not considered an anomaly. It simply means we discovered a new important parameter, and the worst case is keeping slightly more parameters than necessary.</li><li>Pattern removed entirely (not anomalous): If a query pattern from the previous MIQPS is absent in the new one, this is not flagged. Patterns can naturally disappear as a domain’s URL structure evolves.</li></ul>If more than A% of existing patterns are flagged as anomalous, the entire MIQPS update is rejected and the previous version is retained. This ensures the system never regresses — it errs on the side of over-keeping parameters rather than accidentally dropping ones that affect content identity.<h3>System Architecture and Integration</h3>The MIQPS system fits into Pinterest’s content processing pipeline as follows:Figure 3: End-to-end system architecture.<figure><figcaption>Figure 3: End-to-end system architecture. The content ingestion pipeline produces a URL corpus per domain. An offline job analyzes parameter importance via content ID comparison, then publishes the MIQPS to a config store after anomaly checks. The URL processor reads the MIQPS at runtime to normalize URLs during content processing.</figcaption></figure>The architecture has three distinct phases:<ul><li>Content Ingestion: As URLs are processed from domains, the system writes each unique URL to a per-domain corpus stored in S3. This happens continuously as part of normal content processing.</li><li>MIQPS Computation: After a content processing cycle completes for a domain, an offline job is triggered. This job downloads the URL corpus, runs the MIQPS algorithm (grouping, sampling, content ID comparison), performs anomaly detection, and publishes the result to both a config store (for runtime consumption) and S3 (for archival and debugging).</li><li>URL Normalization: At runtime, the URL processor loads the MIQPS map from the config store at initialization. For each URL it processes, it looks up the query pattern, retrieves the non-neutral parameter set, and strips all parameters not matched by any of the four normalization layers.</li></ul>This separation of concerns means the expensive content ID comparison happens offline and asynchronously, while runtime URL normalization is a fast, in-memory lookup.An alternative design would be to determine parameter importance **in realtime** — rendering the page with and without each parameter at the moment a URL is first encountered. This would eliminate staleness entirely and provide immediate coverage for newly discovered domains. However, we chose the offline approach for several reasons:- Latency: Each content ID computation requires rendering a full page, which takes seconds. Testing every parameter in a URL would multiply this cost, adding unacceptable latency to the content processing pipeline.- Cost: Offline analysis scales with the number of domains, while realtime analysis would scale with the number of URLs — orders of magnitude more expensive.- Reliability: Transient rendering failures in an offline job are isolated and retryable. In a realtime path, they would directly block content processing.In practice, the offline approach is a natural fit because URL parameter conventions change infrequently — on the order of weeks or months. The small amount of staleness between computation cycles is an acceptable tradeoff for the massive savings in cost, latency, and operational complexity.<h3>Conclusion</h3>URL normalization may seem like a mundane infrastructure problem, but at Pinterest’s scale — with a large number of domains and billions of URLs — getting it right has outsized impact on content quality.The MIQPS algorithm brings several key properties to this challenge:<ul><li>Dynamic and data-driven: MIQPS automatically adapts to each domain’s URL conventions without requiring manual configuration or domain-specific rules. As a domain’s URL structure evolves, the algorithm discovers new patterns and adjusts accordingly.</li><li>Layered and defense-in-depth: The multi-layer normalization strategy combines static allowlists, regex patterns, and dynamically computed MIQPS. Each layer catches a different class of parameters, and a parameter only needs to match one layer to be preserved.</li><li>Conservative and regression-resistant: The anomaly detection system ensures that MIQPS updates never regress — previously important parameters cannot be silently dropped. The system consistently errs on the side of keeping parameters rather than stripping them.</li><li>Scalable and cost-efficient: By grouping URLs by pattern, focusing on the top K patterns, and using early exit optimizations, the algorithm keeps computational costs manageable even across hundreds of thousands of domains.</li></ul>By aligning normalization strategies with proven content identity signals, MIQPS ensures every unique item or experience is surfaced cleanly — improving search and recommendations, downstream catalog management, and ultimately the user experience.<hr><a href="https://medium.com/pinterest-engineering/smarter-url-normalization-at-scale-how-miqps-powers-content-deduplication-at-pinterest-4aa42e807d7d">Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Finding zombies in our systems: A real-world story of CPU bottlenecks</h1> Pinterest Engineering — Wed, 15 Apr 2026 16:01:04 GMT Vaibhav Shankar; Staff Software Engineer | Raymond Lee; Staff Software Engineer | Chia-Wei Chen; Staff Software Engineer | Shunyao Li; Sr. Software Engineer | Yi Li; Staff Software Engineer | Ambud Sharma; Principal Engineer | Saurabh Vishwas Joshi; Principal Engineer | Charles-A. Francisco; Senior Engineer | Karthik Anantha Padmanabhan; Director, Engineering | David Westbrook; Sr. Manager, EngineeringOne day in early 2025, the Kubernetes platform team at Pinterest (<a href="https://medium.com/pinterest-engineering/pincompute-a-kubernetes-backed-general-purpose-compute-platform-for-pinterest-8ad408df2d6f">PinCompute</a>) got a ping from our partners on the ML platform team. Their <a href="https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52">Ray-based training jobs</a> , which often take hours of computation on expensive GPU hardware, were crashing. Not every time, but often enough that it was becoming noticeable. Their logs indicated that their distributed training jobs were seeing intermittent loss of network connectivity, and that ultimately caused their jobs to crash. Their ask was simple:<ol><li>Why is this happening?</li><li>Can you please make it stop?</li></ol>What started there led to a more than three-month-long investigation and a great lesson in profiling performance bottlenecks. Read on to learn from our fun story about CPU bottlenecks, AWS network drivers, and yes, how we discovered Zombies in our system!<h3>Background: Ray at Pinterest</h3>At Pinterest, Ray has risen as the backbone of our next-gen ML training and inference. Over the past few years, it has enabled us to scale systems, accelerate experimentation, and significantly boost the performance of models powering our diverse ML workloads.We have previously shared deep dives on our progress, including: Ray Infrastructure (provisioning ray cluster on in-house K8s clusters at scale [<a href="https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52">blog</a>]), Batch Inference with Ray (scaling to hundreds of nodes [<a href="https://medium.com/pinterest-engineering/ray-batch-inference-at-pinterest-part-3-4faeb652e385">blog</a>][<a href="https://www.youtube.com/watch?v=HDSy09hrm2I">talk</a>]), Ray for Training (distributed dataloaders and throughput optimization [<a href="https://www.youtube.com/watch?v=yqVLRONwDJs">talk</a>]), and Last-Mile Data Processing (reducing experimentation cycles [<a href="https://medium.com/pinterest-engineering/last-mile-data-processing-with-ray-629affbf34ff">blog 1</a>][<a href="https://medium.com/pinterest-engineering/scaling-pinterest-ml-infrastructure-with-ray-from-training-to-end-to-end-ml-pipelines-4038b9e837a0">blog 2</a>]).Today, we run more than half of the offline ML workload company-wide on Ray, provisioning tens of thousands of Ray clusters per month, a feat made possible only by a robust Kubernetes environment.<h4>Network Model & Challenges</h4><figure><figcaption>Figure 1: Ray architecture at Pinterest</figcaption></figure>What makes the network stability challenging lies in Ray’s unique network model.Ray operates as a highly “network-active” system. A Ray cluster generates constant, intensive inter-pod gRPC traffic that is fundamental to the cluster’s operation, with the following two distinct layers:<ul><li>Control Plane: Handles stateful operations, such as node health check, task submission, actor scheduling, and the maintenance of Object References.</li><li>Data Plane: Handles the high-volume transfer of values within the Object Store. Our Large-scale ML training relies on this plane to move data rapidly between nodes.</li></ul>Because this traffic is highly distributed and latency-sensitive, the impact of network instability is often non-deterministic, manifesting across various components of Ray Cluster:<ul><li>Job Hanging: Caused by actor state corruption following brief network interruptions. [<a href="https://www.google.com/search?q=link&authuser=1">github issue</a>]</li><li>ObjectFetchTimedOutError / ObjectLossError</li><li>ActorDiedError</li><li>Node failed the health check and crashed</li><li>…</li></ul>All of these occurrences resulted in one common outcome: our Ray Training jobs would crash (some use cases with > 25% Success Rate drop), resulting in loss of expensive compute hours and significant slowdown in Model building and experimentation. After grinding for over a month seeking solutions for individual issues in the Ray stack, the ML Platform team realized it was necessary to turn our attention to look for more lower level network issues with our friends from the PinCompute team.<h3>Symptom 1: Network driver resets</h3>At Pinterest, our Kubernetes clusters are backed by AWS EC2 instances, which leverage the ENA Network driver (<a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/RELEASENOTES.md">ref</a>) as a standard traffic component. This Network driver works with AWS Elastic Network Interfaces (ENIs) and sets up receive and transmit queues for buffering packets. Our first symptom that something was wrong was identifying that whenever the ML training jobs failed with network connectivity issues, it correlated with a Network driver ‘<a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst">reset</a>’, as seen in our system logs.<pre>[] ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000). Time since last napi 6596000 usec. napi scheduled: 1 [] ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time but is scheduled # .... Bunch of stats excluded.... [] ena 0000:20:03.0: ENA Large LLQ is disabled [] ena 0000:20:03.0: Device reset completed successfully, Driver info: Elastic Net work Adapter (ENA) v2.11.0g</pre>From the reference docs:Q: What is [the] ENA device reset?A: ENA device reset is a self healing mechanism that is triggered when the driver detects unexpected device behavior. Example of such behavior could be an unresponsive device, missing keep-alive events from the device, Tx completions timeouts, netdev timeout etc. The device reset is a rare event, lasts less than a millisecond and might incur loss of traffic during this time, which is expected to be recovered by the transport protocol in the instance kernel.Ok, so the driver saw Tx threads paused for an extended period of time (hardcoded to 5s in AWS ENA Kernel drivers), and caused the device to be reset, which could cause some packet drops. A typical reason for resets was documented as <a href="https://github.com/amzn/amzn-drivers/blob/master/kernel/linux/ena/ENA_Linux_Best_Practices.rst#cpu-starvation">CPU starvation</a>, i.e, when the Network driver’s threads don’t get CPU time for several seconds. So perhaps something CPU intensive was starving out the Network driver threads?<h3>Symptom 2: CPU utilization</h3>Our next observation was that some of the machines where we saw network resets exhibited high system CPU usage and that correlated nicely with the CPU starvation theory in the ENA documentation. We speculated that our training jobs were leveraging inefficient memory allocators and that was resulting in High page faulting.<figure><figcaption>Figure 2: Page faults per second on impacted machines</figcaption></figure>We did what many reasonable people would do:<ul><li>We tried using Huge pages (by turning on <a href="https://docs.kernel.org/admin-guide/mm/transhuge.html">TransparentHugePages</a>) to reduce page faulting.</li><li>We experimented with more efficient memory allocators like <a href="https://jemalloc.net/">jemalloc</a></li><li>We tried to give the training jobs their own CPU cores by providing them CPU affinity via <a href="https://man7.org/linux/man-pages/man1/taskset.1.html">taskset</a>.</li><li>Out of desperation, we played with interrupt pinning for ENA drivers by steering network interrupts to other cores.</li></ul>Nothing worked. While we saw some drops in overall CPU utilization and page faulting from the memory allocators and huge pages settings, the network resets continued. They sometimes happened very early in a training job run and sometimes several hours into their execution. Across 100s of training job runs, it was hard to predict when exactly we’d see a network reset, if at all.One mitigation did work, albeit briefly and it’s everyone’s favourite IT crowd advice: Yes, we turned it off and on again. When we rebooted machines with high amounts of resets, they were able to support running ML jobs just fine.. that is until they weren’t. We clocked it at approximately one week of uptime, after which the network resets returned on the rebooted machines.<h3>Symptom 3: Availability zone differences</h3>To further understand the problem, the ML platform team started emitting metrics whenever an ENA reset was observed. Once the metrics were available, the team noticed something odd — the network resets were happening on machines in one AWS Availability zone only and all their jobs with identical parameters were running just great on other zones.<figure><figcaption>Figure 3: Network resets per Availability Zone</figcaption></figure>The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone) but when the team looked at our cluster configurations across different zones, they seemed identical. They were running the same version of Kubernetes and the same system image. So, did we get a bad hardware batch!? We reached out to our excellent AWS support team and after several engagements, were convinced that the issue was definitely not on the AWS side. Their analysis was clear: there was something on our machines in the us-east-1a zone, which was heavily using the CPU and causing the network threads starvation. So why would one availability zone’s machines only exhibit this network reset behaviour?<h3>Profiling attempts: perf and mpstat</h3>We decided it was time to stop with high level metrics and start profiling what was actually using the high amounts of CPU. Performance engineers know all about <a href="https://www.brendangregg.com/perf.html">perf</a> and its versatility. perf is a Linux profiler that can provide insights into ‘hot’ code paths and a call stack indicating CPU time spent by a particular process on a machine. Initially, our rudimentary snapshots of perf revealed the same suspected actors: Page faulting and some heavy computation from our ML jobs. However, this didn’t indicate CPU starvation all on its own.<figure><figcaption>Figure 4: perf snapshot on an impacted machine</figcaption></figure>We realized that for CPU starvation to happen, it may take as little as one CPU core to be heavily utilized and block an unlucky network thread that was scheduled onto that core. Moreover, we realized that our GPU machines had 96 vCPU cores, which meant that an overall perf view told us very little about what was happening in each individual core.To address this, we used <a href="https://linux.die.net/man/1/mpstat">mpstat</a> to get an overview of per core utilization on a per-second basis for an hour to identify if specific cores were using up large amounts of CPU. In our offline analysis, we found that sometimes, a single CPU core (in the following screenshot, CPU 39) was often using 100% of its system CPU for multiple seconds! This also correlated with when a network reset happened. We were finally closing in on the root cause!<figure><figcaption>Figure 5: 100% System CPU utilization on a single core (Core 39) when profiled per second.</figcaption></figure>Given these network resets were happening at unpredictable times and we lacked perf runs from the times of the reset, we were still missing one key detail: what process was using up the CPU for this extended period of time?<h3>Temporal profiling: Time is an important factor</h3>We realized that if there was a sporadic process (think something in your crontab or some kind of periodic sync loop in a process) that was causing high CPU utilization at specific times on the machine, then a random perf sample wouldn’t tell us about that. We needed a tool like <a href="https://github.com/intel/gprofiler">gProfiler</a> to be running for an extended period of time and then ‘time travel’ to a specific point in time to look at what was happening on the CPU cores at that time. Unfortunately, at the time of this incident, we didn’t have gProfiler running everywhere within our fleet, but the principles were sound! Thanks to some creative setup from our ML platform team, we created the following experimental setup:1. Reserved a small number of machines (via Kubernetes <a href="https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/">taints</a>) for analysis2. Kicked off a series of training jobs in parallel on these machines. For simplicity, we repurposed our in-house Hyper-parameter tuning to orchestrate identical model training across reserved machines, allowing each training run’s resource footprint to remain fairly constant.3. Kicked off a script that ran perf in 2 minute increments with profiles and CPU stacks data saved to disk. The script looked a bit like this and ran on all of our reserved machines as a system process.<pre># Bash program to generate CPU stacks snapshots on a machine. # Run perf record for 2 minutes at a time, since each perf data file can become very large for longer periods. Record the start time in the filename for 'time traveling' later! Running this 360 times covers roughly a 12 hour period of profiles $ for i in {1..360} do sudo perf record -F 97 -g -a -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data -- sleep 120 done # Generate perf stacks $ for datafile in `ls perf-*` do perf script --header -i $datafile > $datafile.stacks done</pre>4. We ran the data collection overnight (~12 hours) and waited for a reset to be triggered. Since our ML training jobs typically ran for 8–12 hours, we were confident that we would observe a reset over this period across at least a subset of the training jobs.Sure enough, when we came to analyze the data the next day, we found that network driver resets had been triggered along with Job failures. Unlike before, we now had perf data to examine from the time of the reset! We fetched the perf results for the 2 minute time window around the reset event and visualized it with the excellent <a href="https://github.com/Netflix/flamescope">Flamescope</a> tool, courtesy our friends at Netflix. Flamescope allows us to view a 2 minute CPU stack with a time travel view, allowing us to zoom into a subset of the time window and observe what was happening on the CPU at that time. From the ENA reset logs, we found that the reset had happened about 70 seconds into this profile, so we zoomed in to a 5 second region from the high-level view around the reset.<figure><figcaption>Figure 6: Temporal high-level view of CPU utilization from flamescope. X-axis is time from 0–120 seconds for the 2 minute snapshot</figcaption></figure>Our first observation was that the kubelet, our lightweight Kubernetes agent, was occupying ~6.5% of total CPU usage a few seconds before an ENA reset. This was alarming and interesting because the rest of the time, the Kubelet barely broke 1% of CPU usage.<figure><figcaption>Figure 7: Profile of the CPU just before ENA resets. Notice the high kubelet utilization.</figcaption></figure>We zoomed in a bit deeper and found that the kubelet was spending a lot of time on a system call: mem_cgroup_nr_lru_pages.<figure><figcaption>Figure 8: Zoomed in profile of the CPU stacks for the kubelet process</figcaption></figure>We now had a suspect: something was causing the Kubelet to iterate over all the <a href="https://docs.kernel.org/admin-guide/cgroup-v1/memcg_test.html">memory cgroups</a> on the host and spending significant time on the CPU. At the same time when we were researching this, we came across this <a href="https://blogs.oracle.com/linux/zombie-memcg-issues">excellent post</a> on the Oracle blog describing the problem of zombie memory cgroups. Could we be running into this problem? Fortunately, that blog post guided us perfectly and we saw the following on a network driver resetting machine:<pre># Kernel tracked cgroups (including zombies) $ cat /proc/cgroups | grep memory | awk '{print $3}' 68680 # Actual cgroups $ find /sys/fs/cgroup/memory/ -type d | wc -l 240</pre>Yup, we definitely had zombies! Nearly 70,000 memory cgroups tracked in the Kernel but only 240 actually in use. Iterating over that long list in the system call was likely what was causing the CPU utilization spikes on a single core and if a network thread landed on that core at just the right time, it could become starved! But what was causing the high build up of memcgs?<h3>Beware of system defaults</h3>Our theory at this point was that the build up memcgs was from some crashlooping container, which kept re-creating cgroups and leaking memcgs. We didn’t see any such container created by Kubernetes but spotted a container that was always only a few seconds old when we queried the docker API:<pre>$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c6fdfc760921 amazon/amazon-ecs-agent:latest "/agent" 11 seconds ago Up 10 seconds (health: starting) ecs-agent</pre>Why was the Amazon ECS Agent running (and repeatedly crashing!) in our Kubernetes nodes? This was certainly unintentional given <a href="https://aws.amazon.com/ecs/">ECS</a> is an alternative container orchestration platform that we weren’t using. It turns out that for our GPU instances, we were leveraging the <a href="https://docs.aws.amazon.com/dlami/">AWS Deep Learning AMI</a> (Ubuntu 20.04) as a base machine image and it set up ecs-agent as a default systemd unit. As part of the machine’s bootstrap process, it also started the ECS agent, which over several days of crashing accumulated a massive amount of memory cgroups. The ECS Agent was correctly crashing since we did not give our machines permissions to join an ECS cluster and so it was natural that the container failed to start up. This also explained why rebooting the machines gave us temporary relief because rebooting reset the memcg counts!We fixed the issue by simply turning off the ECS agent systemd unit in our base images and rebooting all our machines to purge the zombie memcgs. After this, we noticed that memory cgroups remained stable and most importantly, Ray Training jobs were running with their expected high success rate again. The problem of ENA resets and the zombies in our machines was fully resolved and our ML training teams could go back to building awesome new models to serve Pinterest customers!<h3>Hold on! What about the availability zones disparity?</h3>Oh.. right. Well, erm, we messed up a little. See, when we said that the two node configurations were identical across the two clusters, that was only mostly true. For our Kubernetes cluster in the unaffected availability zone, we had an independent bug where we delivered the same Kubernetes binary via two different URLs to the two clusters. Long story short, the difference in URLs caused a last step that emitted a metric to fail and caused the node bootstrap script to get marked as failed. This prevented the ECS agent from starting up because its systemd unit depended on the bootstrap script to successfully complete, which in turn allowed the nodes to remain ‘healthy’, at least from the perspective of not accumulating memcgs! The Kubernetes team was aware of this different URL issue and was independently working on fixing that as well, which in turn would have brought the network reset issue to the unaffected Availability zone as well.<h3>Key Takeaways</h3><ul><li>Introducing fleet wise metrics to track transient issues on the Platform is helpful to identify failure patterns. In this case, it helps us understand that the issue was correlated to AZ/Cluster setup, further leading us to isolate and consistently reproduce the problem.</li><li>Create reproducible, closed environments for iterative debugging. In our case, the partnership between the PinCompute and ML Platform teams to set up debugging experiments was critical to quickly identifying the root cause of the issue.</li><li>Invest in profiling tools and especially temporal profiling tools! They’re great and will save you hours and hours when working on hard-to-debug performance problems. At Pinterest, we’re developing and rolling out <a href="https://github.com/intel/gprofiler">gProfiler</a> in close collaboration with Intel for debugging situations like this.</li><li>Be aware of what processes are running on your base OS images. Sometimes, the defaults aren’t necessarily the right ones for your environment. Invest in profiling the success rate of your systemd units and watch out for the impact of regular failures.</li><li>When looking at differences between two environments that look the same but act differently, look closer.. You’re probably missing some piece of configuration that is causing the two paths to diverge. Better yet, invest in good automated tooling to ensure your environments are truly identical.</li></ul><hr><a href="https://medium.com/pinterest-engineering/finding-zombies-in-our-systems-a-real-world-story-of-cpu-bottlenecks-ea4722e552eb">Finding zombies in our systems: A real-world story of CPU bottlenecks</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Scaling Recommendation Systems with Request-Level Deduplication</h1> Pinterest Engineering — Mon, 13 Apr 2026 19:01:01 GMT Authors: Matt Lawhon | Sr. Machine Learning Engineer; Filip Ryzner | Machine Learning Engineer II; Kousik Rajesh | Machine Learning Engineer II; Chen Yang | Sr. Staff Machine Learning Engineer; Saurabh Vishwas Joshi | Principal EngineerAt Pinterest, scaling our recommendation models delivers outsized impact on the quality of the content we serve to users. Our <a href="https://arxiv.org/abs/2507.12704">Foundation Model</a> (oral spotlight, ACM RecSys 2025), for example, achieved a 100x increase in transformer dense parameter counts and a 10x increase in model dimension; translating directly into meaningful quality improvements across multiple recommendation surfaces.¹But a 100x scaleup creates massive infrastructure pressure. Storage, training, and serving costs all threaten to grow proportionally unless you’re deliberate about efficiency. The single highest-impact technique we’ve deployed to hold costs in check across all three dimensions is request-level deduplication: a family of techniques that ensures we process and store request-level data once, not once per item.In this post, we’ll walk through what request-level deduplication is, why it matters so much for modern recommendation systems, and how we applied it across the full ML lifecycle , from storage compression to training correctness and speedups to serving throughput gains.<h3>Background</h3>A request is triggered when a user opens their feed, kicking off the recommendation funnel:<ul><li>Retrieval: Aggregate user and request information into one or multiple embeddings, then fetch a large set of potentially relevant items from the entire corpus using techniques like nearest neighbor search.</li><li>Ranking: Aggregate user, request, and item information to make predictions about relevance or engagement. Typically there are early-stage ranking models (which need cheap per-item inference since they score many items) and late-stage ranking models (which can afford more expensive per-item inference since fewer items are ranked).</li></ul><figure></figure>The same user data flows through every stage of this funnel, and within each stage, it’s duplicated across every item scored. Request-level deduplication refers to the category of techniques that eliminate this redundancy when storing, moving, or transforming this data.The impact can be extremely high because:<ul><li>Request-level data is massive. It largely consists of user sequences, approximately 16K tokens encoding all actions a user has taken on the platform. These sequences power sequential user understanding components like the <a href="https://arxiv.org/abs/2507.12704">Pinterest Foundation Model</a> and <a href="https://arxiv.org/abs/2506.02267">TransAct</a>. Each sequence is duplicated identically for every candidate item scored, hundreds to thousands of copies per request.</li></ul>Processing this data is expensive. The computation associated with user tower models in retrieval and user sequence understanding components in ranking represents a significant proportion of total recommendation system compute.<figure></figure><h3>Storage</h3>One of the key ways deduplication pays off is at the storage level. A row in a training dataset typically consists of [request/user, content item, engagement label], and we can have hundreds or thousands of content/engagement labels associated with a single request. Without deduplication, the same massive user sequence is stored redundantly for every single content interaction.<figure></figure>By leveraging <a href="https://iceberg.apache.org/">Apache Iceberg</a> with user ID and request ID based sorting (<a href="https://medium.com/pinterest-engineering/how-pinterest-accelerates-ml-feature-iterations-via-effective-backfill-d67ea125519c">How Pinterest Accelerates ML Feature Iterations via Effective Backfill</a>, <a href="https://medium.com/pinterest-engineering/scaling-pinterest-ml-infrastructure-with-ray-from-training-to-end-to-end-ml-pipelines-4038b9e837a0">Scaling Pinterest ML Infrastructure with Ray</a>), we achieve 10–50x storage compression on user-heavy feature columns.² When rows sharing the same request are physically co-located, columnar compression algorithms handle the deduplication automatically.Beyond raw storage savings, request-sorted data enables improved dataset tooling:<ul><li>Bucket joins: Matching keys are co-located, eliminating expensive shuffle operations.</li><li>Efficient backfills: We can update only affected user segments rather than reprocessing entire datasets.</li><li>Incremental feature engineering: Adding new request-level features becomes a localized operation: we can append new columns to existing row groups without duplicating the entire dataset.</li></ul>Stratified sampling: Request-sorted data enables user-level sampling, ensuring training datasets maintain proper diversity without over-representing highly active Pinners.<h3>Training</h3><h4>Addressing Independent and Identically Distributed (IID) Disruption</h4>Early experiments with request-sorted data revealed 1–2% regressions on key offline evaluation metrics in our ranking models.² The root cause was the disruption of the IID assumption.<figure></figure>With IID sampling, each batch contains engagements spread across many users, yielding stable and representative statistics. With request-sorted data, batches become concentrated around fewer users, causing batch-level statistics to fluctuate dramatically based on individual user behavior. Each gradient update is computed from a less representative slice of the data: the model sees a noisier, more biased view of the training distribution, which slows convergence and degrades final quality.The specific vulnerability lies in Batch Normalization (BatchNorm), which normalizes intermediate values by computing mean and variance across the batch. Standard BatchNorm computes these statistics independently on each device’s local batch. When batches are request-sorted and highly correlated, a batch dominated by a single power user will have dramatically different statistics than one with a casual browser.<h3>Fix: Synchronized Batch Normalization (SyncBatchNorm)</h3>SyncBatchNorm aggregates statistics across all devices before normalization. This effectively increases the “statistical batch size” used for computing means and variances, even though each device still processes its local request-sorted batch. The result is that normalization statistics are computed over a much more diverse set of users and requests, restoring the representative statistics that standard BatchNorm enjoyed with IID data.In practice, this simple one-line change fully recovered the performance gap. The communication overhead of synchronizing statistics across devices was negligible compared to the training speedups gained from deduplicated computation.<figure></figure><figure></figure>With IID sampling, the probability that a randomly sampled in-batch negative is actually a positive for the anchor user is negligible: users engage with a tiny fraction of the total item corpus. With request-sorted data, however, batches are concentrated around fewer users, and each user may have dozens or hundreds of engagements grouped together. Many in-batch “negatives” are actually items the user engaged with, they’re false negatives. The false negative rate jumps from ~0% with IID sampling to as high as ~30% with request-sorted data, depending on the number of unique users per batch.²Training the model to push apart items the user actually engaged with actively degrades retrieval quality.<h3>Fix: User-Level Masking</h3>To address this, we extended our existing identity masking to also exclude negatives that belong to the same user as the anchor. The standard InfoNCE loss with logit correction:<figure></figure>becomes:<figure></figure>where:<ul><li>s(·,·) is the similarity function (e.g., dot product) between user and item embeddings</li><li>x_i is the user embedding for the anchor engagement i</li><li>y_i is the positive (target) item for engagement i</li><li>y_k represents candidate negative items from batch B</li><li>x_k is the user associated with candidate k</li><li>p_y values are streaming frequency estimates (<a href="https://research.google/pubs/pub48840/">Yi et al., 2019</a>) used for logit correction</li><li>x_k ≠ x_i is the new constraint: only use engagements from other users as negatives</li></ul>This simple masking change allowed us to successfully adopt request-sorted data for retrieval model training while preserving model quality.<h3>Manifesting Training Speedups</h3>The previous sections focused on correctness, ensuring model quality is preserved when switching to request-sorted data. Here we discuss how to actually realize the compute and memory savings that deduplication enables.<h4>Data Loading</h4>Our data loading infrastructure, shared across ranking and retrieval models, is designed to maintain deduplication as long as possible in the pipeline. All preprocessing and feature transformations operate on deduplicated request-level data. We only reduplicate (expand) at the very end, on GPU or directly in the model’s forward pass. This minimizes CPU-to-GPU transfer costs and memory allocation overhead.<h4>Retrieval Models</h4><figure></figure>Achieving request-level compute deduplication in retrieval models is straightforward thanks to the two-tower architecture. Since the user tower has no item dependencies by definition, we rewrite the forward pass to run the user tower on the deduplicated batch of R unique requests rather than the full batch of B user-item pairs. The item tower continues to operate on the full batch. Gradients for the user tower are computed at the deduplicated level and appropriately accumulated.Though conceptually simple, the savings compound in practice, memory allocation, I/O, and compute all benefit, particularly for large user sequence models where the user tower dominates training cost.<h3>Ranking Models: Deduplicated Cross-Attention Transformer (DCAT)</h3>Ranking models present a greater challenge because transformer architectures used for user understanding typically have item dependencies: each candidate item attends to the user history, coupling request-level and item-level computation.To address this, we developed DCAT, described in detail in the <a href="https://arxiv.org/abs/2507.12704">Pinterest Foundation Model paper</a>. The key insight is to separate the transformer into two components:<ol><li>Context: Apply the transformer to the user’s historical action sequence once per deduplicated request. The keys and values (KV) from each layer are cached.</li><li>Crossing: Each candidate item performs cross-attention with the cached user history KV, reusing the deduplicated context computation.</li></ol>This optimization, implemented with custom <a href="https://triton-lang.org/">Triton</a> kernels for both training and serving, achieved significant throughput gains over standard self-attention with <a href="https://arxiv.org/abs/2205.14135">FlashAttention</a>.<h3>Training Impact</h3><figure></figure>Taken together, request-level deduplication delivered a 4x end-to-end training speedup for retrieval and a ~2.8x speedup for ranking (40% from deduplicated data loading compounded with a 2x gain from DCAT cross-attention).²<h3>Serving</h3>For retrieval, serving has always been correctly deduplicated by design: we embed the user once and search against the item index. No changes were needed.For ranking, the DCAT architecture provides the same deduplication benefit at serving time as it does during training. The context transformer processes the user’s action sequence once per request, the key-value (KV) cache stores the intermediate representations, and each candidate item cross-attends to this cached context. This avoids redundantly recomputing the full user sequence for every item scored.The result is a 7x increase in ranking serving throughput.² This is what made it possible to deploy a 100x larger model without proportional serving cost increases, absorbing the full Foundation Model scaleup while holding infrastructure budgets in check.<h3>Conclusion</h3>Request-level deduplication delivered impact across every layer of our ML lifecycle:<ul><li>Storage: 10–50x compression on user-heavy feature columns via Iceberg and request sorting²</li><li>Training: 4x retrieval speedup and 2.8x ranking speedup from deduplicated data loading and DCAT²</li><li>Serving: 7x throughput increase via DCAT and custom Triton kernels²</li></ul>Three lessons stand out:<ol><li>Request-level deduplication is a cross-cutting technique. It improves storage, training, and serving simultaneously because the same fundamental redundancy exists at every layer.</li><li>Simple fixes unlock big wins. SyncBatchNorm and user-level masking are minimal code changes with outsized impact. The hardest part was identifying the problems; the solutions were straightforward.</li><li>Impact compounds across the stack. Storage compression enables faster data pipelines, training speedups accelerate experimentation velocity, and serving throughput reduces infrastructure cost, freeing capacity for the next round of model scaling.</li></ol>¹ <a href="https://arxiv.org/abs/2507.12704">Pin Foundation Model</a>, ACM RecSys 2025. ² Pinterest Internal Data, Global, 2025.<h3>Acknowledgements</h3>This work reflects joint efforts across multiple teams at Pinterest. We’d like to thank: Devin Kreuzer, Piyush Maheshwari, Hanlin Lu, Xue Xia, Abhinav Naikawadi, Yuming Chen, and Aditya Mantha (Personalization); Kousik Rajesh, Xiangyi Chen, Zelun Wang, Hanyu Li, Pong Eksombatchai, Jaewon Yang, Yi-Ping Hsu, and Hongtao Lin (Applied Sciences); Raymond Lee, Sheng Huang, Neha Upadhyay, Nazanin Farahpour, Henry Feng, Alekhya Pyla, Rubin Fergerson, and Shengtong Zhang (ML Platform); Shivin Thukral, Joseph Bongo, Zach Barnes, and Yang Cao (Search); and Anya Trivedi, Akshay Iyer, and Rui Liu (Notifications).<hr><a href="https://medium.com/pinterest-engineering/scaling-recommendation-systems-with-request-level-deduplication-93bd514142d9">Scaling Recommendation Systems with Request-Level Deduplication</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Performance for Everyone</h1> Pinterest Engineering — Wed, 08 Apr 2026 16:01:01 GMT Author: Lin Wang (Android Performance Engineer)<figure></figure><h4>Default Feature</h4>For mobile apps, performance is considered as the “default feature”, which means apps are expected to run fast and be responsive. It’s just as if we expect a watch to show the time. With no exceptions at Pinterest, we measure, protect and improve performance for all of our key user experiences’ surfaces, such as “Home Feed” and “Search Result Feed”.<h4>Hard to Measure</h4>Among all the performance metrics, the user perceived latency is a crucial one. It measures how much time the user spends since they perform an action until they see the content. This is also called “Visually Complete”.Visually Complete can be very different from app to app or even from surface to surface within one app. On Pinterest’s “Video Pin Closeup” surface, Visually Complete means the full-screen video starts playing; on our “Home Feed” surface, Visually Complete is defined as all the images rendered and videos playing; on our “Search Auto Complete Page”, Visually Complete refers to the search autocompleted suggestions’s text rendered along with the avatar images.<figure></figure>Given this dynamic nature of Visually Complete, engineers had to create customized measurement logic for each surface and that takes a lot of engineering effort and maintenance cost. This ends up as a major boundary for general product engineers to work on performance, especially on newly created surfaces. On average, it takes two engineer-weeks to implement a User Perceived Latency metric on the Android Client and wire it up to all the toolsets for production usage.<h4>All-In-One Solution</h4>Over the years, the performance team at Pinterest has been thinking about how to offer performance measures with the lowest cost to product engineers. Therefore, more product engineers can more easily have access to their feature’s user perceived latency information and work on performance.Until recently, it seems we have found an answer to this. In a nut shell, we built the Visually Complete logic into the base UI class (e.g. BaseSurface). Therefore, the Perceived Latency of any UI surface (existing or new) will be automatically measured as long as the feature is built on top of this base UI class.<h4>Walk the View Tree</h4>First we define a few common media view interfaces: PerfImageView, PerfTextView, PerfVideoView. Each of them contains a few methods to report their rendering status: isDrawn(), isVideoLoadStarted(), x(), y(), height(), width(), etc.<figure></figure>At the BaseSurface level, given that we should have access to the root android ViewGroup (e.g. RootView). We could just iterate through the view tree starting from the RootView by visiting all the views on this tree. We will focus on those visible views and judge if all the PerfImageView, PerfTextView and PerfVideoView instances are all drawn or started if it’s a video.<figure></figure><h4>In Production</h4>Since the release of this system on Android, it constantly visualizes the User Perceived Latency on over 60 surfaces at any given time. It is well received by many product teams and started to protect and improve their surface’s performance.<figure></figure><h4>Interesting Cases</h4><ul><li>Since all surfaces are measured by the same standard, we can compare multiple surfaces’ performance fairly.</li><li>For some features with short shelf time (e.g. a Christmas landing page), we previously weren’t able to code their latency metrics in time, but now those latency metrics will be ready since the surface is built.</li></ul><h4>Conclusion</h4>Once the performance metrics are offered to product engineers for free, it makes Pinterest’s performance more visible and encourages everyone to protect and optimize the User Perceived Latency on their surfaces.Following the success on Android, we have also extended the same concept to iOS and web platforms.<h4>Acknowledgements</h4>Special thanks: Arun K<hr><a href="https://medium.com/pinterest-engineering/performance-for-everyone-21a560260d08">Performance for Everyone</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> <article> <h1>Evolution of Multi-Objective Optimization at Pinterest Home feed</h1> Pinterest Engineering — Tue, 07 Apr 2026 16:01:01 GMT Homefeed: Jiacong He, Dafang He, Jie Cheng (former), Andreanne Lemay, Mostafa Keikha, Rahul Goutam, Dhruvil Deven Badani, Dylan Wang Content Quality: Jianing Sun, Qinglong Zeng ML Serving: Li Tang<h3>Introduction</h3>In feed recommendation, we recommend a list of items for the user to consume. It’s typically handled separately from the ranking model where we give probability predictions of user-item pairs.Pinterest’s feed recommendation follows a cascaded system design with retrieval [1][2], pre-ranking [3], ranking [4][5], and re-ranking. While most of these prior works focus on optimizing immediate actions for each candidate Pin, this work will primarily focus on how we build the final layer of the recommendation funnel for multi-objective optimization. This is a critical part of our recommendation system as it helps us balance short-term and long-term engagement, drive new use case adoption, and satisfy various business requirements. Throughout the years, we have made substantial improvements on this layer through both algorithmic and infrastructure upgrades. In this tech blog post, we will share our experiences, learnings and improvements we’ve made over the years on this critical layer.<h3>Overall System Design</h3><figure><figcaption>Figure 1. Cascaded Design of Pinterest Funnel.</figcaption></figure>Figure 1 illustrates the cascaded funnel design of our feed recommendation system from retrieval to ranking to the multi-objective optimization component. While earlier stages mostly optimize for certain positive actions (e.g., saves) given an impression, the multi-objective optimization layer tackles a different problem: determining the best composition of a feed served to the user. This is critical as users tend to have lower intent when visiting Home Feed and their browsing behavior will be significantly impacted by what they see. For example, visually repetitive content is less engaging and is likely to reduce the user’s session length and the likelihood that a user will revisit Pinterest.<h3>Multi-Objective Optimization Design</h3>In this section, we describe the detailed design of our multi-objective optimization layer.<h4>Diversification</h4>Feed diversification is an important factor for continued user satisfaction. We empirically found that when removing the feed-level diversity component, users’ immediate actions (e.g., saves) increase on day 1 but quickly turn negative by the second week. This also comes with a reduced session time and other negative downstream effects which significantly reduces the user’s long-term satisfaction. It is important to note that when users engage with less diverse content, engagement signals will also be affected, reinforcing the system to generate less diverse content.To achieve better short-term and long-term engagement, we applied a diversity-based re-ranking algorithm in our feed as the main part of the multi-objective optimization layer. It is also one of the most important parts of the multi-objective re-ranking system.<h4>V1: Determinantal Point Process (DPP)</h4>DPP is widely used in the industry for feed diversification [6][7]. In our first generation of feed diversification, we leveraged DPP as the main component.Mathematically, DPP is parametrized by a kernel matrix Lₙₓₙ where the diagonal entry Lᵢᵢ measures the relevance/quality of the i-th item, and the off-diagonal entries Lᵢⱼ = Lⱼᵢ measure the similarity between item i and j. Practically, we use learned embedding such as GraphSAGE [8] and categorical taxonomy as a lever to determine item and item similarity. Thus, DPP’s kernel matrix can be generalized to L = f₀(Λ) g𝜓(S) f₀(Λᵀ) where Λ is the diagonal matrix whose diagonal entries are relevance scores of items, f₀(·) is a monotonic increasing element-wise transformation.Our first version of the feed diversification algorithm was implemented in 2021 based on the DPP algorithm.Since its launch, it has become one of the most impactful components in our system. As the system becomes increasingly responsive through more real-time signal adoption such as in TransACT[5], we have found out that user satisfaction improves when they have more diverse feed recommendations through DPP. We conducted an ablation study by removing the DPP component and found that the user’s time spent impression reduced by over 2% after the first week.<h4>V2: Sliding Spectrum Decomposition</h4>Sliding Spectrum Decomposition (SSD) [9] is a position‑adaptive diversification method used in the recommendation system that views a candidate feed as a mixture of latent “spectra” (topics/intents/styles). As we render the feed top‑down, SSD repeatedly decomposes the local similarity structure within a sliding window and rebalances exposure: under‑represented spectra are promoted while over‑represented spectra are softly penalized. This yields locally smooth yet globally balanced diversity, complementing slate‑global methods like DPP.Mathematically, let X ∈ Rⁿˣᵈ be item embeddings and S ∈ Rⁿˣⁿ a symmetric similarity matrix built from learned representations (e.g., GraphSAGE). At position t with window size w, restrict S to the window S^(ᵗ) and compute a top-K spectral decomposition S^(ᵗ) ≈ U^(ᵗ) Λ^(ᵗ) U^(ᵗ)ᵀ. Let r ∈ Rⁿ be base relevance scores. SSD tracks cumulative exposure Eₖ(𝑡) per local spectrum k and defines an adjusted utility: Uᵢ(𝑡) = f(rᵢ) − β ∑ₖ₌₁ᴷ wₖ(𝑡)·(uₖ^(ᵗ)[i])² where f(·) is a monotone transform of relevance, β controls diversity strength, and wₖ(𝑡) increases with exposure relative to current spectral mass (e.g., wₖ(𝑡) ∝ Eₖ(𝑡) / (ε + λₖ^(ᵗ)). The next item is i⁎ = argmaxᵢ(Uᵢ(𝑡)); exposures are updated and the window slides.Compared to DPP, sliding spectrum decomposition has lower computational complexity given that it avoids Cholesky-style similarity matrix decompositions. The original paper introducing SSD algorithm (<a href="https://arxiv.org/pdf/2107.05204">link</a>) gave a comprehensive comparison between different variations of DPP algorithms vs SSD algorithms:<figure><figcaption>Table 1: Comparisons of greedy inference complexity for SSD and DPP with dense item embeddings. In general, we have 𝑁 > 𝑇 > 𝑤 and 𝑑 > 𝑤. [9]</figcaption></figure>Moreover, the implementation logic of sliding spectrum decomposition is built from standard linear-algebra blocks (windowed similarity, top-K eigen/SVD, weighted penalties, etc.) and can be implemented cleanly in PyTorch with straightforward operations. It avoids positive semi-definite enforcement, log-determinants, and fragile numerical issues common in DPP (e.g., jittered kernels, Cholesky failures), enabling a straightforward “PyTorch-style” model approach with vectorized scoring and lower serving latency.In early 2025, we launched the SSD algorithm, leveraging PyTorch for its diversification logic. This was executed on our company-wide model serving clusters. The SSD algorithm’s simplicity allowed us to incorporate more features for evaluating pairwise Pin similarities, ultimately leading to improved balance between engagement and diversification.<h4>Unified Soft-Spacing Framework</h4>With SSD it further enabled us to incorporate quality goals when evaluating pairwise pin similarities in the backward window. For content less aligned with our quality standards, we added a quality penalty score on top of the SSD objective for which we call it “soft spacing”, as it allowed us to avoid having these content clustered together while also balancing with engagement and diversification.We define the soft spacing penalty: qᵢ(t) = 𝟙[cᵢ ∈ R] ∑{d=1}^w (1/d) 𝟙[c{t−d} ∈ R]. It’s applied when item i belongs to the sensitive set R and nearby previously placed items in the backward window also belong to R, with each prior item inversely weighted by distance. We then subtracted the soft spacing penalty term to the adjusted utility Uᵢ(t) with a coefficient λ to balance with other objectives.This is an important next step for improving content quality on Pinterest and protecting users from content that warrants additional caution, where in the past we usually rely on strong enforcement like filtering which sometimes leads to less satisfying user experience if there is no backfill. In mid 2025 we launched the soft spacing penalty on content with elevated quality risk, to restrict its distribution and ensure the utmost quality standards at Pinterest. In late 2025 we further abstracted the logic via building an easy to use, config-based framework to make it more extendable to meet and adapt to quality needs.<h4>System Infrastructure Evolution</h4>At the launch of DPP, the main multi-objective optimization (blending) layer is composed of a sequence of “nodes.” Several Lightweight Reranking nodes first perform low-latency reordering to optimize for short-term engagement and coarse diversity. Candidate pins are then passed to the DPP node, where the more time-intensive DPP algorithm is applied. Before the system outputs the final recommendation list, additional heuristic reordering logic is still needed, such as the spacing strategies mentioned earlier. This chain of nodes is embedded within the Home Feed recommendation backend system. While this setup is relatively robust because it can directly leverage existing backend dependencies, it makes iteration on blending-layer logic challenging due to limited flexibility for local testing and the difficulty of experimenting with new features.With the introduction of SSD, a significant portion of the blending layer’s logic, including much of the diversification logic, has been migrated to PyTorch and is now hosted within the company’s model serving cluster. Our ongoing efforts aim to transfer more heuristic logic from the blending layer to the model server, thereby simplifying chain execution within the blending layer.Evolution of blending layer is exemplified by the graph below:<figure><figcaption>Figure 2. Homefeed Blender System Infrastructure Evolution.</figcaption></figure><h4>Evolution of Diversity Signals</h4>With DPP, our feed diversification stack relied primarily on categorical signals (taxonomy labels such as home decor, fashion, cooking, etc.) and on GraphSage as the primary mechanism for defining similarity between Pins.In early 2025, we migrated our diversification process to a CPU-served SSD algorithm implemented in PyTorch. This made it easier to incorporate richer embedding representations when computing pairwise Pin similarity. SSD’s lower serving latency, relative to DPP, allows us to use a broader set of signals. Specifically, SSD uses the following embeddings to represent Pins and drive diversification:Visual embeddings: capture visual redundancy and style similarity.Text embeddings: capture overlap in titles and descriptions.Graph embeddings (GraphSage): capture relatedness in the Pin graph, including co-engagement patterns and neighborhood similarity.In Q2 2025, we added soft-spacing capabilities to address a business need: reducing clustered content exposure without relying on brittle, one-size-fits-all hard-spacing rules. As part of this work, we incorporated content quality signals that identify content requiring additional caution, allowing SSD to demote a candidate when similar content has appeared within a preceding window.In Q3 2025, we upgraded SSD’s visual embedding to use PinCLIP image features [10]. PinCLIP provides a stronger multimodal visual representation, learned through image-text alignment with additional graph-aware objectives. Critically, this signal is also available in near real-time, which improves representation quality and, in turn, downstream similarity and diversification behavior, for recently ingested Pins.More recently, in Q4 2025, we added a Semantic ID signal [11] to address a practical gap: while embeddings are excellent at capturing how close two Pins are, they do not always provide a stable, category-like notion of semantics that is useful for controlling diversity. Semantic IDs provide a hierarchical representation derived through coarse-to-fine discretization of content representations, enabling us to reason more explicitly about semantic overlap between items. In SSD, we discourage recommending too many Pins with high Semantic ID prefix overlap by applying a penalty term. This improves both perceived diversity and engagement by reducing repeated content clusters.For future works, we are focusing on ensuring diversity across user specific interests and having a proper representation of the interests the user historically engaged with.<figure><figcaption>Figure 3: Diversity component timeline</figcaption></figure><h4>On-going and Future Works</h4>Currently, we have various different on-going works to optimize the final layer. This includes two major workstreams: 1) a unified generative post-ranking model that optimizes the final slate generation in an end-to-end manner 2) reinforcement learning based value model.. We will share more details in later blog posts.<h4>Acknowledgement</h4>We would like to thank all of our collaborators across Pinterest. Ruimin Zhu, Yaron Greif, Ludek Cigler, Jason Madeano, Alekhya, Jaewon Yang, Xianxing ZhangReference: [1] <a href="https://medium.com/pinterest-engineering/establishing-a-large-scale-learned-retrieval-system-at-pinterest-eb0eaf7b92c5">Establishing a Large Scale Learned Retrieval System at Pinterest</a> [2] <a href="https://medium.com/pinterest-engineering/advancements-in-embedding-based-retrieval-at-pinterest-homefeed-d7d7971a409e">Advancements in Embedding-Based Retrieval at Pinterest Homefeed</a> [3] <a href="https://medium.com/pinterest-engineering/pinterest-home-feed-unified-lightweight-scoring-a-two-tower-approach-b3143ac70b55">Pinterest Home Feed Unified Lightweight Scoring: A Two-tower Approach</a> [4]<a href="https://arxiv.org/abs/2209.08435"> Rethinking Personalized Ranking at Pinterest: An End-to-End Approach</a> [5] <a href="https://arxiv.org/abs/2306.00248">TransAct: Transformer-based Realtime User Action Model for Recommendation at Pinterest</a> [6]<a href="https://arxiv.org/abs/1207.6083"> Determinantal point processes for machine learning</a> [7] <a href="https://jgillenw.com/cikm2018.pdf">Practical Diversified Recommendations on YouTube with Determinantal Point Processes</a> [8]<a href="https://arxiv.org/abs/1706.02216"> Inductive Representation Learning on Large Graphs</a> [9] <a href="https://arxiv.org/abs/2107.05204">Sliding Spectrum Decomposition for Diversified Recommendation</a> [10]: <a href="https://arxiv.org/pdf/2603.03544">PinCLIP: Large-scale Foundational Multimodal Representation at Pinterest</a> [11] <a href="https://arxiv.org/pdf/2305.05065">Recommender Systems with Generative Retrieval</a><hr><a href="https://medium.com/pinterest-engineering/evolution-of-multi-objective-optimization-at-pinterest-home-feed-06657e33cd10">Evolution of Multi-Objective Optimization at Pinterest Home feed</a> was originally published in <a href="https://medium.com/pinterest-engineering">Pinterest Engineering Blog</a> on Medium, where people are continuing the conversation by highlighting and responding to this story. </article> </main></body></html>