System Status

Issues loading Rippling

Incident Report for Rippling

Postmortem

Summary

Rippling experienced a major platform outage on Monday August 4, 2025, between 12:35 PM PDT and 2:08 PM PDT. This happened when we enabled a performance monitoring feature on our core primary database as part of an investigation aimed at improving database efficiency. This exceeded the load limit on the database, depleting the capacity it required to serve the full volume of regular requests powering the site. This resulted in slow load times, errors, and widespread inaccessibility of Rippling and integrated third-party apps for users.

We partially mitigated the issue by reducing workloads running against the database and increasing the capacity of the database. Gaps in our telemetry prevented us from more quickly identifying the root cause. We finally mitigated the issue when we disabled the performance monitoring feature.

Since then, we’ve taken immediate steps to understand the behavior of this performance profiler (and how it shows up in operational dashboards), improve monitoring, and reduce the baseline level of load on the database. After understanding the root cause, we introduced safeguards to prevent the performance profiler from inadvertently being enabled in future.

The system remains stable, and the database has not exhibited this issue since the incident.

Timeline

Timestamp Event Elapsed time (min)
Aug 4 2025 12:23 PM PDT Profiler configured to profile all queries N/A
Aug 4 2025 12:35 PM PDT Site began exhibiting elevated latency 0
Aug 4 2025 12:39 PM PDT Incident initiated and the team commenced debugging efforts 4
Aug 4 2025 12:40 PM PDT Database identified as the source of the issue, and relevant personnel were engaged 5
Aug 4 2025 1:21 PM PDT Commencement of background task load shedding 46
Aug 4 2025 1:37 PM PDT Primary failover initiated 62
Aug 4 2025 2:05 PM PDT Write IOPS capacity doubled 90
Aug 4 2025 2:08 PM PDT Profiler configured to normal safe level and site restored 93

Root Cause

The root cause was the activation of a 100% query profiler on the core primary database by a debugging process. This caused the database to analyze, profile and write performance statistics for each query executing on it to a collection inside the database. This consumed nearly all the capacity in the database to perform write operations, starving the regular operations required to serve Rippling's site from the ability to conduct write operations.

Our current database vendor has limited observability features, which makes detecting configuration changes like these less straightforward. Their support team does not possess any tools that could provide visibility into the situation, which required us to engage members of the vendor's core engineering team to analyze and debug the root cause. We conducted our own root cause analysis and arrived at the same conclusion as the vendor. Both these analyses took two weeks to definitively conclude.

Resolution

Upon detecting the database degradation at Aug 4 12:39 PM PDT, the incident response team immediately declared a SEV-1 incident and focused on mitigating it.

After mitigating the incident and restoring site availability at Aug 4 2:08 PM PDT, the incident response team took the following mitigating and preventative actions:

  • Operationalized a load shedding operational runbook to be activated upon detection of similar degradation. This runbook sheds background and write loads on the database, aiming to stabilize the database as quickly as possible to reduce outage windows.
  • Offloaded high-traffic, less critical datasets from the core database to another database
  • Reduced transaction timeouts to minimize lock contention to preserve database capacity under large write load
  • Redirected a significant portion of read traffic to read replica instances to reduce the baseline load on the primary database
  • Optimized a critical data model, resulting in an approximate 50% reduction in data size

They also maintained an active 24x7 vigilance, until the root case was confirmed, to detect and immediately respond to any recurrences.

Concurrently, a number of Rippling infrastructure engineers and database vendor support and core engineering team members started working on the root cause analysis. The database vendor provided a root cause analysis on Aug 15. The findings agreed with our independent root cause analysis which concluded on Aug 14. 

Following the identification of the root cause, immediate safeguard measures were implemented to prevent the same configuration from inadvertently being applied to the database.

We are actively working with the vendor, who assured us of their commitment to address these observability gaps.

Action Items

We are committing to the following immediate action items:

  • Upgrading to a later version of the database server, which offers enhanced visibility for profiler changes
  • Collaborating with our vendor to improve observability into such changes and implement corresponding alerts
  • Auditing all routes that could inadvertently activate this profiling behavior and implementing safeguards to prevent inadvertent activation
Posted Aug 29, 2025 - 15:17 UTC

Resolved

This incident has been resolved.
Posted Aug 07, 2025 - 14:17 UTC

Update

We are continuing to monitor for any further issues.
Posted Aug 05, 2025 - 15:59 UTC

Update

We are continuing to monitor for any issues. Customers using custom workflows will see delays in action execution
Posted Aug 04, 2025 - 21:47 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Aug 04, 2025 - 21:37 UTC

Update

We are continuing to work on a fix for this issue.
Posted Aug 04, 2025 - 21:27 UTC

Update

We are continuing to work on a fix for this issue.
Posted Aug 04, 2025 - 21:19 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Aug 04, 2025 - 21:08 UTC

Investigating

We are currently investigating this issue.
Posted Aug 04, 2025 - 19:53 UTC
This incident affected: Rippling App and Unity (Workflow Automator).