Tackling flaky end-to-end tests isn't about quick hacks. It's about a methodical process: digging into the root cause—be it timing, test data, or wonky selectors—and swapping out fragile fixes for genuinely resilient solutions. To build a test suite you can actually trust, you need to stop using fixed waits and start waiting for specific network or UI events to happen.

The Hidden Cost of Flaky Tests in Your CI/CD Pipeline

We've all been there. The CI/CD pipeline flashes red, and the team’s first reaction isn't to investigate a bug, but to sigh and say, "Oh, just re-run it." This is more than just a minor frustration; it’s a symptom of a deeper problem that quietly sabotages your team's momentum and, worse, their faith in your entire testing process.

When test failures become background noise, a dangerous "alert fatigue" sets in. The pipeline, which should be your most reliable guard against defects, loses all its credibility. Developers start to tune out the failures, and it's only a matter of time before a critical, user-impacting bug gets a "lucky pass" on a re-run and slips straight into production.

The Real-World Impact of Test Instability

Now, think about the cost in engineer-hours. Every minute your team spends re-running a job, trying to reproduce a failure that "only happens on CI," or debugging a test that inexplicably works on their own machine is a minute they aren't shipping new features. This productivity sinkhole is a direct blow to your delivery schedule. For agile teams trying to move quickly, this can be absolutely devastating.

A flaky test is a failing test. There is no middle ground. Treating instability as anything less than a critical bug in your test code is the first step toward a test suite that nobody trusts.

Here in Australia’s fast-paced tech scene, the problem is especially sharp. We see small engineering teams at SaaS companies, often relying on tools like Playwright or Cypress, wrestling with flakiness in 25-40% of their E2E suites.

I recently worked with a Melbourne-based fintech startup that was losing 15% of its quarterly velocity to this exact issue. One in five of their tests failed randomly due to subtle browser timing problems, burning over 200 engineer-hours a month on manual re-runs and investigations. This isn't just an anecdote; it reflects a major market trend. Application testing is a massive investment, with Australia's software testing market projected to grow by USD 1.7 billion. You can get a sense of this growth by reviewing a detailed market analysis from Technavio.

Before you can fix flaky tests, you need to understand their true cost. This isn't just a technical nuisance; it's a serious business problem. Seeing it this way makes it much easier to justify investing the time to make your tests stable and dependable. The rest of this guide is a playbook for doing just that—transforming those unreliable tests into the rock-solid assets they were always meant to be.

Your Diagnostic Playbook for Identifying Flakiness

Alright, let's get into the detective work. Before you can even think about fixing a flaky test, you have to understand its strange behaviour. The real goal here is to move from a vague "it fails sometimes" to a very specific "it fails when this happens". This shift in mindset is everything; it’s how you go from being a frustrated developer to a methodical investigator.

First things first: stop dismissing those intermittent failures as random quirks. A test that fails unpredictably is a broken test, plain and simple. Your mission, should you choose to accept it, is to reproduce the flakiness reliably. Trust me, a test that fails 100% of the time on your local machine is a thousand times easier to fix than one that fails 1% of the time in the CI pipeline.

This is the cycle we're trying to break. A single flaky test triggers a manual rerun, which seems harmless at first, but it quickly snowballs.

Diagram illustrating how flaky tests lead to manual reruns and ultimately result in lost time.

Every manual rerun erodes your team's confidence in the test suite and quietly burns through engineering hours, slowing everyone down.

Reproduce and Gather Evidence

Your primary task is to make the failure happen on command. One of the most effective tricks I've found is to isolate the troublesome test and just run it in a loop. It sounds crude, but it's a brute-force way to dramatically increase your chances of catching the failure in the act and spotting a pattern.

If you’re using Playwright, you can run a specific test file over and over with a simple shell command.

Run the test file 10 times to try and trigger the flaky behaviour

for i in {1..10}; do npx playwright test my-flaky-test.spec.ts; done

For the Cypress crowd, you can do something very similar. The idea is to focus the runs on a single spec file to amplify the odds of seeing it fail.

Cypress doesn't have a built-in repeat flag, but a shell loop works

for i in {1..10}; do npx cypress run --spec "cypress/e2e/my-flaky-test.cy.ts"; done

As that loop is running, keep your eyes glued to the console. Look for anything out of the ordinary. Does it always die on the exact same step? Does the failure line up with a specific network request or a weird console error?

Capturing artifacts at the moment of failure is non-negotiable. Don't just look at a pass/fail status. The video, trace file, and console logs are your clues to solving the mystery.

Dig Deeper with Test Artifacts

Modern testing frameworks give us an incredible toolkit for forensics. They can capture the state of your application at the very moment a test goes wrong. These artifacts aren't just nice-to-haves; they are your most valuable pieces of evidence. Make sure your CI pipeline is set up to automatically save these on every failed run. If you want a bit more context, understanding the fundamentals of test scripts in software testing can help you realise what's happening under the hood.

Here’s what you should be combing through in the artifacts:

Videos and Screenshots: These give you the visual story. You’ll often spot the culprit immediately—an unexpected pop-up, a loading spinner that never disappeared, or an element that rendered in a completely different spot than your test expected.
Trace Files (Playwright): Honestly, the Playwright Trace Viewer is a game-changer. It’s like a time machine for your test, giving you a complete recording with DOM snapshots, console messages, and a full network log. You can scrub back and forth to see exactly what the browser was doing before, during, and after the failure.
Console Logs: Sift through these for JavaScript errors in your application code, failed network requests (look for 4xx or 5xx status codes), or even console.warn messages that hint at an unusual application state.

By systematically reproducing failures and then meticulously analysing the evidence, you move from guesswork to a proper diagnosis. This methodical approach is the bedrock for applying the right fix and finally getting your test suite stable and reliable again.

Alright, you've managed to pin down a flaky test. Now for the hard part: fixing it for good.

It's so tempting to just slap a sleep(2000) on it and call it a day. We've all been there. But that's not a fix; it's a Band-Aid over a gaping wound, and it's guaranteed to cause more chaos later. To genuinely eliminate flakiness, you have to dig in and solve the root cause with a proper, robust solution.

This means rethinking how you select elements, getting smarter about how you handle timing, and making sure every single test runs in its own clean, predictable little bubble. Let's walk through the battle-tested strategies that turn unreliable tests into the solid, trustworthy assets they're supposed to be.

A laptop screen displaying Playwright/Cypress test code for resilient waits, with 'TARGETED FIXES' text.

Build Resilient Selectors That Last

The number one culprit I see for flaky tests? Brittle selectors. A developer refactors a bit of UI, a CSS class name changes, a div gets added for styling, and boom—your test can no longer find the login button. This happens when your test is too coupled to the implementation details of your app's code.

The solution is to create a clear contract between your app and your tests. The most reliable way I've found to do this is with a dedicated test attribute like data-testid.

When you add data-testid="login-button" to a component, you're creating a stable hook for your tests that has nothing to do with styling or DOM structure. It’s also a big, clear sign for other developers: "This is used for testing, don't touch it unless you know what you're doing."

Cypress Example for Resilient Selectors

// Brittle: Relies on fragile CSS classes that might change cy.get('.MuiButton-root.MuiButton-contained.MuiButton-primary').click();

// Robust: Targets a unique, dedicated test ID cy.get('[data-testid="login-submit-button"]').click();

If you can't use a dedicated test ID, your next best bet is to select elements the way a user would find them—by their accessible name or role. This forces your test to behave more like a real person interacting with your site. If you want to go deeper on this, we've written a whole post on testing user flows versus testing DOM elements that really unpacks the philosophy.

Master Asynchronous Operations with Smart Waits

Let me say this as clearly as possible: adding a fixed wait is almost always the wrong move. A cy.wait(3000) is just a wild guess. You're betting the network call or animation will finish in under three seconds. On a good day, you win. When the server is having a slow day, you lose, and your test fails.

The golden rule is this: Never wait for a fixed time; always wait for a specific application state.

Instead of guessing, your test should intelligently wait for something verifiable to happen. Modern frameworks like Playwright and Cypress have incredible built-in tools for exactly this.

Wait for an element to appear or disappear: The classic example is waiting for a loading spinner to be gone before you try to click anything.
Wait for a network request to complete: The most bulletproof way to handle data loading. You can actually tell your test to intercept a specific API call and only proceed once it gets a response.
Wait for text or a value to update: Assert that an element contains the text you expect before you move on to the next step.

Playwright Example for Smart Waits

// Don't do this: A fragile fixed wait await page.waitForTimeout(2000);

// Do this: Wait for the specific network call to finish await page.route('/api/user/profile', route => route.fulfill({ /* mock data */ })); await page.getByTestId('profile-save-button').click(); await page.waitForResponse('/api/user/profile');

// Or wait for a visual cue to disappear await expect(page.getByTestId('loading-spinner')).toBeHidden({ timeout: 10000 });

The beauty of this approach is that it makes your tests both more reliable and faster. They move on the very instant the app is ready, instead of just sitting around waiting for a timer to tick down.

Comparing Flakiness Fixes: Brittle vs Robust

To really drive this home, it helps to see the anti-patterns side-by-side with the robust solutions. It becomes obvious which path leads to a stable test suite and which one leads to late-night CI debugging sessions.

Problem Area	Brittle 'Fix' (Anti-Pattern)	Robust Solution
Selectors	`cy.get('div > div:nth-child(3) > button')`	`cy.get('[data-testid="submit-form"]')`
Timing	`await page.waitForTimeout(5000)`	`await expect(spinner).toBeHidden()`
Data Loading	Waiting an arbitrary time for data to show up.	Intercepting the network request and waiting for its response.
Test Data	Hardcoding `[email protected]` in 15 different tests.	Generating a unique user with an API call in a `beforeEach` hook.

Choosing the robust solution every time is an investment. It might take a few extra minutes upfront, but it pays massive dividends in reliability and developer sanity.

Guarantee Test Isolation and Clean Data

Ever had a test that passes perfectly on its own but fails miserably when you run the full suite? That's a classic symptom of state pollution. One test is leaving behind a mess, and the next one is tripping over it.

Every test needs to be atomic and completely independent. It should set up its own world and tear it down when it's done, leaving the environment exactly as it found it.

The "Campsite Rule" of Testing: Always leave the test environment cleaner than you found it. Each test should be a self-contained story with a clear beginning, middle, and end.

Here are the core principles for achieving true test isolation:

Programmatic Setup: Use API calls in beforeEach hooks to create the data you need (like a user account). This is infinitely faster and more reliable than trying to create data through the UI for every test.
Unique Data: Stop hardcoding values like usernames or email addresses. Use a library or a simple function to generate unique data for every single test run to prevent collisions.
API Teardown: After your test is finished, use an afterEach hook to make an API call that deletes whatever data you created. This ensures the next test gets a completely clean slate.

By combining resilient selectors, smart waits, and strict test isolation, you're not just patching over cracks. You're fundamentally re-architecting your tests for stability. It's the only way to truly fix flaky end-to-end tests for good.

Stabilising Tests in Your CI Environment

We’ve all heard it. The classic, "but it works on my machine," is probably one of the most frustrating phrases in software development. When your tests are rock-solid locally but start failing erratically in the CI/CD pipeline, you’re dealing with a whole new category of problems. These issues almost always boil down to environmental differences, and fixing them means shifting your focus from what the test does to where it runs.

A CI runner, whether on GitHub Actions or GitLab CI, is not your souped-up developer laptop. It has different resources, a completely different network profile, and a host of underlying configurations that can inject chaos into an otherwise perfect test suite.

A neat desk setup featuring an iMac running 'Stable Ci', keyboard, mouse, and potted plants.

Standardise Your CI Runners

Consistency is your first line of defence. If every CI job spins up in a slightly different environment, you're essentially rolling the dice with every commit. The goal here is to make the CI environment as predictable and reproducible as your local machine.

The most effective way to achieve this is through containerisation. By defining your entire testing environment in a specific Docker image, you lock down the exact versions of the operating system, browsers, Node.js, and every other dependency. This completely eliminates that "works on my machine" class of bugs because the CI environment is the local environment.

A few key areas to standardise:

Browser Versions: Always explicitly define the browser version in your CI configuration. A minor patch update pushed by a cloud provider can—and will—introduce subtle changes that break your tests.
System Resources: Make sure your CI runners have enough CPU and memory. Resource starvation is a massive source of flakiness; a choked machine slows everything down, causing your carefully crafted waits and assertions to time out.
Network Conditions: This one’s trickier to control, but just being aware that CI runners often have higher network latency is important. It really drives home the point that you need to avoid fixed delays and wait for application state or network events instead.

Implement Smart Automatic Retries

Now, let's talk about automatic retries. It's a controversial subject, and for good reason. When used carelessly, retries just paper over deep-rooted problems. But when applied strategically, they are a fantastic tool for weathering genuine, one-off CI hiccups—like a momentary network blip or a brief server load spike.

Here's a hard-won lesson: only retry on CI, and never retry more than once. A single retry is perfect for absorbing random environmental noise. Any more than that, and you're just masking a real bug that needs fixing.

Most modern testing frameworks like Playwright have this feature built-in. You can configure retries directly in your playwright.config.ts file and, crucially, tell it to only apply during CI runs.

// In playwright.config.ts retries: process.env.CI ? 1 : 0,

This simple configuration strikes the perfect balance. It gives you a safety net against rare infrastructure flakiness without letting developers ignore real, reproducible issues on their local machines.

Optimise Artifact Collection for Debugging

When a test inevitably fails in CI, you need evidence. But collecting artifacts like videos and trace files for every single test run is a waste of resources. It chews through storage and can even slow down your pipeline.

The best strategy is to only collect artifacts for failed tests, and specifically on the final retry attempt. This approach gives you a clean, actionable report for every genuine failure without cluttering your pipeline with data from successful runs or initial flaky attempts.

Here’s how you’d set this up in Playwright:

screenshot: 'only-on-failure'
video: 'retain-on-failure'
trace: 'retain-on-failure'

With this in place, a failure notification comes with a complete diagnostic package, making the whole debugging process worlds faster.

Parallelise Tests Without Resource Contention

Parallelisation is non-negotiable for keeping a growing test suite fast. But if you’re not careful, running too many tests at once can introduce a new source of flakiness: resource contention. Trying to run 10 browser instances on a CI runner with only two CPU cores will bring the machine to a grinding halt, leading to timeouts and all sorts of random failures.

The trick is to match your level of parallelisation to your CI runner’s resources. If your runners have four cores, start by setting your worker count to four. From there, you can tweak the number to find the sweet spot that maximises speed without overwhelming the system. It’s a delicate balance, but getting it right is critical if you want a CI process that is both fast and reliable—which is the ultimate goal when you need to fix flaky end-to-end tests.

Future-Proofing Your Test Suite with Monitoring and AI

Putting out fires is exhausting. Once you've squashed the flaky tests you have today, the real victory is preventing them from flaring up again tomorrow. This requires a mental shift—moving away from a reactive "fix-it" mode and towards a proactive, "future-proofing" mindset.

It's all about building a system that not only spots instability early but also makes your tests fundamentally more resilient from the get-go. This journey starts with solid monitoring. You need to treat the health of your test suite with the same seriousness you give to your application's performance. Tracking the right data points helps you see trends and nip problems in the bud, long before they bring your entire CI pipeline to a grinding halt.

Building Your Test Health Dashboard

You can't fix what you can't see. Setting up a simple dashboard to visualise your test suite's health is a genuine game-changer. It transforms that vague, frustrating feeling of "flakiness" into hard data your team can actually work with.

To get started, focus on tracking a few key metrics over time:

Pass Rate Stability: Don't just glance at the overall pass rate. Dig deeper and track the pass rate of individual tests over their last 10-20 runs. A test that bounces between passing and failing is practically waving a red flag for flakiness.
Test Duration Creep: Keep an eye on the average execution time for your tests. Has a specific test suddenly started taking longer? That spike could mean it's waiting on a slow resource, a classic prelude to a timeout-related failure.
CI Retry Frequency: How often is your CI system automatically retrying failed tests? A high retry count is a massive warning sign that you're just wallpapering over cracks instead of fixing the foundations.

This data acts as your early warning system, giving you the chance to tackle instability before it erodes your team's confidence in the test suite.

Creating a Quarantine for Unstable Tests

Once your monitoring flags a test as consistently flaky, you have to act decisively to protect the integrity of your main branch. A "quarantine" process is the best way I’ve found to handle this without just deleting the test and losing its coverage.

A flaky test that's allowed to run on the main branch is a broken window. It signals that unreliability is acceptable, which slowly corrodes the entire team's standards for quality.

The process is straightforward but incredibly effective.

First, you isolate the offender. The moment a test is confirmed as flaky (for example, failing 2 out of 5 consecutive runs without any code changes), it gets moved out of the main CI pipeline immediately.

Next, you assign clear ownership. A high-priority ticket is created and assigned to an engineer. The test isn't forgotten; it's now a tracked piece of technical debt that needs to be paid down.

Finally, the assigned engineer has a clear mission: either fix the root cause of the flakiness for good or, if the test provides little value, make the call to delete it. This quarantine strategy keeps your main pipeline a reliable source of truth, unblocking developers while making sure flaky tests are dealt with systematically.

Embracing AI for Inherently Resilient Tests

While monitoring and quarantines help you manage existing flakiness, the next frontier is building tests that are resistant to it from their very creation. This is where AI-powered tools are starting to make a real difference, fundamentally changing how we approach test authoring.

Instead of writing brittle code, you can use tools like e2eAgent.io to describe a test scenario in plain English. For example: "Click the login button, enter the user's credentials, and verify the dashboard appears." The AI agent takes these instructions and intelligently handles the tricky implementation details—like finding robust selectors and waiting for the right UI states to appear—which are the most common sources of flakiness.

You can learn more about this approach in our guide to end-to-end testing with AI.

This method helps future-proof your suite because the tests aren't so tightly coupled to the DOM structure. If a button's ID or class changes during a refactor, a smart AI can often adapt, because it understands the intent ("click the login button") rather than relying on a fragile selector. By letting your team focus on what needs to be tested, not how, you can build a suite that's not only faster to create but dramatically more robust and easier to maintain in the long run.

Common Questions About Fixing Flaky Tests

Even with a solid game plan, you're bound to run into some recurring questions when you start squashing flaky tests. Let's tackle some of the most common hurdles I've seen teams face, with practical answers to get you back on track.

How Many Times Should a Test Fail Before I Consider It Flaky?

There isn't a single magic number here, but my go-to is the "2 out of 5" rule. If a test fails twice within five consecutive runs on your main branch—with no code changes to explain it—you’ve likely got a flaky one on your hands.

Sure, a single random failure can be a fluke. The server might have blipped, or a network resource could have timed out. But when you see a repeating pattern, it points to a genuine, underlying issue with the test itself or the environment it runs in. The key is to flag it early before it destroys your team’s confidence in the entire test suite.

The goal isn't just a green build; it's a reliable signal. A test that passes 80% of the time is just as untrustworthy as one that fails 80% of the time. Both just create noise and hide real regressions.

Is It Ever Okay to Use a Fixed Wait or 'Sleep' in a Test?

Almost never. I see this all the time, and it’s a major anti-pattern. Dropping a fixed wait like cy.wait(2000) or await page.waitForTimeout(2000) into your code is a direct path to flakiness.

It’s a gamble that only ends one of two bad ways: either you slow down your entire test suite for no reason, or the wait isn't long enough when the application is under load, causing the test to fail anyway.

The professional, robust solution is to always wait for a specific condition. Your test needs to intelligently poll for a verifiable change in the application's state.

Wait for an element to become visible or disappear.
Wait for a specific network request to complete successfully.
Wait until a piece of text appears on the screen.

This approach makes your tests as fast as possible while building in resilience against unpredictable load times.

My Tests Are Still Flaky After Trying Everything. What Now?

So you’ve perfected your locators, implemented smart waits, and meticulously isolated your test data, but the flakiness persists. When you get to this point, the problem often isn't your code anymore—it's the sheer complexity you're fighting. This is a common wall for teams to hit, where they realise they’re spending more time nursing tests than shipping features.

This is exactly where a tool like e2eAgent.io can be a game-changer. Instead of writing and maintaining brittle code, you provide plain English instructions like, "Click the login button and verify the dashboard appears." An AI agent handles the complex selector logic and dynamic waits behind the scenes, even adapting to minor UI changes on its own. It lets you shift your focus from how the test runs to what it should verify, which can slash your maintenance overhead.

How Do I Convince My Team to Invest Time in Fixing Flaky Tests?

You have to frame the conversation around cost and velocity, not just technical debt. To get buy-in, you need to translate the engineering pain into clear business impact.

Start tracking the hours your team loses re-running failed CI jobs, manually checking flaky results, and debugging false alarms. Present that data as a direct hit to your feature delivery timeline. For example: "Last month, we lost 20 engineer-hours to flaky tests, which pushed our next major release back by three days."

When you quantify the pain like this, an engineering headache becomes a compelling business case. It's no longer just about a developer's annoyance; it's about shipping valuable features faster and with much higher confidence.

Tired of battling brittle test code? e2eAgent.io lets you write stable end-to-end tests in plain English, allowing AI to handle the flaky parts for you. Stop maintaining, start testing. Learn more at https://e2eagent.io.

How to fix flaky end-to-end tests: Eliminate Unreliable Tests