Test Kitchen/FAQ

This page includes information about the purpose, functionality, and status of Test Kitchen.

General FAQs

What core capabilities does the Test Kitchen deliver?

Simpler approach and tools for on-wiki instrumentation
Coordination and analysis of experiments (A/B tests)
Off-wiki instrumentation support

What are the benefits of Test Kitchen?

Test Kitchen provides standardized tools and processes to:

Reduce experiment setup time from 10 weeks to 1 week
Enable testing across multiple wikis, languages, and platforms
Support both registered, temporary and anonymous user testing
Automate data collection and analysis
Ensure compliance with privacy policies and security requirements
Make experiment results publicly accessible to movement stakeholders

What's the best route for an engineer on my team, who has not used Test Kitchen before, to get familiar and able to use it for an upcoming experiment?

Start by identifying which workflow fits your needs: Conduct an experiment if you're running an A/B test, or Measure product health if you're setting up baseline instrumentation. These pages will walk you through the process end to end.

Follow the Local development setup guide to get your environment ready.

If you get stuck or want feedback on your approach, the #talk-to-experiment-platform Slack channel is the best place to ask questions. Sharing your measurement plan (template) or instrumentation spec (template) will help us give you useful guidance.

Test Kitchen & Events

How Does Test Kitchen Relate to the Event Platform?

Test Kitchen may be thought of as an "opinionated" Event Platform client – and because it's opinionated there's less for you to do. It does not replace the Event Platform; it works alongside it, making instrumentation accessible to engineers and community members who don't have Data Engineering support.

Firstly, Test Kitchen owns and maintains a set of base schemas with which your events will be validated, so that you do not have to create a new schema for each new instrument. These schemas include properties for the most common instrument-agnostic data that teams might need to answer their questions, e.g. session ID, pageview ID, namespace and title of the current page. The base schemas also have properties that can hold instrument-specific interaction data.

Secondly, your code passes event names and data to Test Kitchen, rather than streams and events. That is, rather than writing an instrument that submits events to a specific stream, you write an instrument that dispatches events, constructed by Test Kitchen from your event data, to zero or more interested streams.

If you need custom data not available in the base schemas, you can create a custom schema that references them. Note that you are responsible for owning and maintaining any custom schemas you create.

Will it replace the current event platform?

No. Test Kitchen represents a new model of data collection to make the instrumentation process faster and easier to use. It is a client of the Event Platform itself. It's designed to make instrumentation accessible to engineers and community members who don't have Data Engineering support, by removing the need to create schemas and manage stream configuration manually.

On-Wiki Instrumentation

What will be different if I use Test Kitchen for instrumentation?

Key Differences:

There’s no need to deploy a new schema. Instrumentation happens in the codebase in accordance with the Test Kitchen SDK's rules.
Enablement of data collection from an instrument is performed by the Feature Engineer in consultation with an Analyst.
All configuration of event streams happens in the control plane.
Data Models can be applied to event streams directly for faster and more flexible querying.

Workflow Steps Using Test Kitchen & Current Process
Process Steps	Test Kitchen	Legacy Process
Teams Involved	1-2	3-4
Number of steps to start collecting data	3	10-12
Time needed to start collecting data	1 day post code deploy	6-10 weeks
Requires schema development	No	Yes
Can be included in volunteer projects	Yes	No
Automatic data collection termination	Yes	No
Data Collection Guideline support	Yes	No

Can I still use the existing process for instrumentation?

Yes. Backwards compatibility with the current Event Platform is and will be maintained if you and your team prefer to use this process.

Will I be forced to migrate all my existing schemas and instrumentation to Test Kitchen?

No. Backwards compatibility with the current Event Platform is and will be maintained and so your existing data will continue to be collected.

What is the process for creating new instrumentation?

Follow this guide for measuring product health.

If I migrate existing instruments, does that impact existing Analytics scripts, visualisations, etc.?

Only if you decommission your old instrumentation. You have two paths:

Run both in parallel: if you leave the existing instrumentation in place, running alongside Test Kitchen, nothing will break.
Migrate fully: if you chose to deprecate existing instrumentation and re-instrument using Test Kitchen, you will need to change your queries to match your new data model. Your data model can result in the exact same output if desired - in which case you would just need to change the table names where appropriate.

How do we know that data quality is consistent with existing instrumentations?

Because Test Kitchen is itself an Event Platform (EP) client, you can expect the same baseline event rate and data quality regardless of whether you use Test Kitchen or not. We are always running experiments^[1]^[2] to test, measure and improve data quality.

Test Kitchen provides the same common contextual attributes as existing instrumentation, and does not modify instrumentation-specific data that it is passed.

Experimentation & Feature Flags

What is experimentation and why does it matter?

Experimentation, or A/B testing, is how we test the actual impact of product changes with our users. When we make product decisions, we’re acting on assumptions – that a change will make something easier, more useful, or more engaging. Experimentation lets us test if those assumptions are correct before committing to them.

This matters because our assumptions are frequently wrong. Without experimentation, we risk making confident decisions based on intuition that the data would have contradicted.

Test Kitchen provides a unified platform for running experiments consistently across wikis, languages, and platforms. This consistency means results are comparable across teams, best practices are encoded by default, and teams can run experiments without needing specialist support for each one.

Learn more about why and how we experiment at the Wikimedia Foundation.

What is a feature flag?

Features Flags enable you to change your products behavior from a central location without requiring an entirely new deployment. For example, turn on/off a change to a toolbar or change the placement of buttons in a UI. Engineers and PMs can set a global value for everyone, use traffic rules to assign values to user demographics, and run experiments between different implementations of a feature.

Experiment Design

What's the difference between using `mw-user` and `edge-unique` as my identifier type, and how do I choose?

These identifier types determine who is enrolled in your experiment and how enrollment is managed.

mw-user enrolls users via their CentralAuth global ID. Enrollment is consistent across all wikis and sessions — a user who logs out and back in will remain in the same experiment group. This includes temporary accounts. Use mw-user when you need reliable data about logged-in user behavior.

edge-unique enrolls clients via an anonymous cookie (wmf-uniq). Enrollment covers all user traffic — logged-in, temporary and logged-out — but is based on the cookie rather than user identity. If the cookie is cleared, the client may be re-enrolled into a different group. Note that experiments using edge-unique can only use client-side instrumentation — MediaWiki does not have access to the wmf-uniq cookie. Use edge-unique when your experiment targets logged-out users.

Because edge-unique experiments include both logged-in and logged-out users, results should be analyzed separately by authentication status: logged-out, permanent, and temporary accounts. If you need reliable insights specifically about logged-in user behavior, we recommend running a separate mw-user experiment in parallel or as a follow-up.

For more details, see Experiment design: identifier type.

Note: A logged-in user enrolled in both an edge-unique and an mw-user experiment may fall out of the edge-unique experiment if their cookies are cleared. Their mw-user enrollment will not be affected.

Note: A user enrolled in an edge-unique experiment who then creates an account will still be enrolled in the experiment after account creation, because edge uniques work for logged out, permanent, and temporary account users.

Is `experiment.subject_id` consistent for a logged-in user across platforms and wikis? Is it consistent for both logged-in and logged-out users?

Yes and yes! When using mw-user, the user will be consistently enrolled and assigned across all wikis and across all their sessions. The user may log out and log back in, and their enrolment and assignment will not change.^[3] When using edge-unique, the user will be consistently enrolled and assigned within top-level domain (e.g. wikipedia.org) for the lifetime of their wmf-uniq cookie.^[4]^,^[5]

How many wikis and/or users would we need to achieve statistical significance?

The goal should never be to achieve statistical significance. Statistical significance (p-value) only tells you how likely you are to observe an impact as large or larger than what you observed if there were actually no real impact.

You're likely thinking of statistical power: the probability of correctly detecting an impact when one actually exists. While it's possible to run an underpowered experiment, if the true effect is small enough to be hard to detect, you won't be able to detect it reliably.

To conduct a proper power analysis, you need to know the baseline metric and the minimum effect size you want to detect. Since we lack historical data, we don't know realistic improvement targets (1%, 5%, 10%, or 20%). Running an experiment on enough wikis or for long enough could detect a tiny +0.01 percentage point change that's statistically significant at the 0.05 level, but this wouldn't represent a practically meaningful change.

When it comes to selecting wikis, ask: "How generalizable are the results?"

The goal of experimentation is to produce results you can generalize from and use to predict the impact your intervention would have if rolled out to everyone. Therefore, determine a small, representative set of wikis (~20) that collectively represent your overall population, since that's ultimately where you'll deploy the change. Focus on representativeness over raw numbers — a well-chosen sample of diverse wikis will give you more actionable insights than a large sample from similar wikis.

How does Test Kitchen handle running multiple experiments at the same time?

Test Kitchen is designed to support multiple concurrent experiments. Each experiment generates a unique subject ID by combining the unique identifier with the experiment name. As the number of experiments increases, the likelihood of being in more than one experiment increases.

To help you account for this, when analytics events are logged using Experiment.send(), the system will soon (T421152) automatically include an experiment.other_assigned field. This records the assignments of any other experiments the user is currently enrolled in, so you can identify whether another experiment may have influenced your results.

What’s the best way to minimize collisions between experiments?

The best way to minimize collisions is to run experiments on different wikis. If two experiments are running on the same product surface for the same users, it becomes difficult to attribute changes in behavior to one experiment rather than the other — even if the features being tested are unrelated.

If you need to run experiments on the same wiki, check the experiment.other_assigned field (after T421152) in your analytics events to identify users who were enrolled in multiple experiments, and account for this in your analysis.

In an A/B/C test, when Test Kitchen divides 100% of a test group by 3 groups, where does the remainder go?

We use the hash of the user as a scaledHash (0 .. 1) and then we multiply this value by the number of buckets/groups. The result is the index of the bucket or group. That the remaining value will end up being part of the last treatment group.

If we have 2 buckets, the assignment, based on the calculated hash, would be something like the following:

0 - 0.5: group 1
0.5 - 1: group 2

And for three groups would be the following:

0 - 0.33: group 1
0.33 - 0.66: group 2
0.66 - 1: group 3

Running an experiment

When we turn off an experiment, do events stop immediately?

Yes. The updated configuration propagates across all caching nodes in under 3 minutes. Once that's complete, the experiment is fully off: traffic is no longer split, no users are seeing treatments, and no data is being collected.

If you're considering ending an experiment before its planned date, make sure you have enough data to draw meaningful conclusions first. Stopping early, especially after seeing a promising early result, increases the risk of a false positive.^[6] If you're unsure whether you have sufficient data, ask in #talk-to-experiment-platform before turning it off.

How long should I run an experiment?

Run your experiment long enough to capture at least one full week of data, to account for day-of-week effects in user behavior. Beyond that, the right duration depends on your key metrics, as well as the minimum effect size you want to detect. An experiment with second-week retention as a key metric will take at least two weeks to collect the data needed, and smaller expected effects require longer run times to detect reliably.

As a general rule, decide on your intended run time before starting the experiment and stick to it. Stopping early because results look promising increases the risk of a false positive. If you're unsure how long to run your experiment, work with your Product Analyst or ask in #talk-to-experiment-platform.

What kinds of changes are not suitable for experimentation?

Not every product decision needs an experiment. Experimentation is most valuable when you have a clear metric, a large enough user base to detect meaningful effects, and genuine uncertainty about whether the change will help.

User research and usability testing are better suited to understanding why users behave the way they do, or to evaluating a design before it's built. If you're exploring a new concept or trying to identify pain points, talk to users before reaching for an experiment.
Surveys are better suited to understanding user attitudes, preferences, and self-reported behavior — things that don't show up in behavioral data. If you want to know how users feel about a change, a survey will tell you more than an A/B test.
Log analysis and observational data are better suited to understanding existing behavior patterns without introducing a change. If you're still forming your hypothesis, analyze what's already happening first.

A good experiment tests a specific, well-formed hypothesis. If you're not yet at that stage, another method will likely serve you better. Talk to your Product Analyst or ask in #talk-to-experiment-platform if you're unsure which approach fits your question.

Do metrics used in an experiment have to be in the Metrics Catalog?

No. But if your metric doesn't yet exist, you'll need to define it, add it to the Metrics Catalog, and register your experiment for automated analytics. Work closely with your Product Analyst to do so.

For example, a team wanting to measure "the percent of users who choose three or more topics of interest" could create a custom metric and add it to the catalog for reuse by other teams later. That said, we'd encourage teams to consider more continuous metrics where possible — such as average topics selected per user or topic selection completion rate — as these tend to produce more statistically robust results.

Interpreting Results

What should I do if my experiment shows a negative result?

A negative or neutral result is a valid and valuable outcome: it tells you that your assumption was wrong, and saves you from investing further in a direction that doesn't help users. It is not a failure.

If your experiment shows a negative result, consider: Was the change implemented as intended? Was the experiment adequately powered to detect the effect you expected? Is there a subset of users for whom the change did have a positive effect that's worth exploring further?

Document your findings and share them internally and with our community. Negative results are just as important for the team's collective knowledge as positive ones. Ask in #talk-to-experiment-platform if you need help interpreting your results.

What do I do with my experiment results once I have them?

Once your experiment has run its course:

Review results with your Product Analyst to ensure they're being interpreted correctly
Make a ship/no-ship decision based on the results
Document your decision and the reasoning behind it
Share results with internal and movement stakeholders publicly on a project page
Shift towards cleaning up the code

If you need a second opinion on your results or your interpretation, #talk-to-experiment-platform is the best place to ask.

Aftermath

What do I do with the code when we've made our decision?

Once you've made your ship/no-ship decision, clean up promptly. Leaving experiment code in place longer than necessary adds technical debt and makes the codebase harder to maintain.

If you're shipping the change: remove the feature flag or toggle wrapping the treatment, make the treatment the default behavior

If you're not shipping: remove the treatment code entirely, Remove the instrument code that uses Test Kitchen SDK, remove the stream configuration, and remove the instrument from the instrument list.

In both cases, follow the Decommission an instrument guide to turn off data collection. If your experiment used a custom metric that won't be reused, consider whether it should be removed from the Metrics Catalog.

How do I decommission an experiment?

Follow the Decommission an instrument guide

References

↑ https://phabricator.wikimedia.org/T401453
↑ https://phabricator.wikimedia.org/T420738
↑ https://wikitech.wikimedia.org/wiki/Test_Kitchen/Conduct_an_experiment#Experiment_design:_identifier_type
↑ https://wikitech.wikimedia.org/wiki/Test Kitchen/Conduct_an_experiment#Experiment_design:_identifier_type
↑ https://meta.wikimedia.org/wiki/Edge_Uniques/FAQ#More_detailed_technical_answer
↑ https://thegood.com/insights/what-is-peeking/

[1] ttps://phabricator.wikimedia.org/T401453

[2] ttps://phabricator.wikimedia.org/T420738

[3] ttps://wikitech.wikimedia.org/wiki/Test_Kitchen/Conduct_an_experiment#Experiment_design:_identifier_type

[4] ttps://wikitech.wikimedia.org/wiki/Test Kitchen/Conduct_an_experiment#Experiment_design:_identifier_type

[5] ttps://meta.wikimedia.org/wiki/Edge_Uniques/FAQ#More_detailed_technical_answer

[6] ttps://thegood.com/insights/what-is-peeking/

[1]

[2]

[3]

[4]

[5]

[6]