Streamlining Incident Response with AI-Assisted Root Cause Analysis

Meta is advancing its suite of investigation tools with an AI-assisted root cause analysis system that leverages a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations.
Streamlining Incident Response with AI-Assisted Root Cause Analysis
Photo by CDC on Unsplash

Streamlining Incident Response with AI-Assisted Root Cause Analysis

Incident response is a critical component of ensuring system reliability. At Meta, we’re investing in advancing our suite of investigation tools to mitigate issues quickly. Our latest innovation is an AI-assisted root cause analysis system that leverages a combination of heuristic-based retrieval and large language model-based ranking to speed up root cause identification during investigations.

Streamlining incident response with AI-assisted root cause analysis

Our testing has shown that this new system achieves 42% accuracy in identifying root causes for investigations at their creation time related to our web monorepo. This is a significant improvement over traditional methods, which can be time-consuming and complex.

The Challenges of Investigating Issues

Investigating issues in systems dependent on monolithic repositories can present scalability challenges due to the accumulating number of changes involved across many teams. Additionally, responders need to build context on the investigation to start working on it, e.g., what is broken, which systems are involved, and who might be impacted. These challenges can make investigating anomalies a complex and time-consuming process.

Our Approach to Root Cause Isolation

Our system incorporates a novel heuristics-based retriever that reduces the search space from thousands of changes to a few hundred without significant reduction in accuracy. Once we have reduced the search space to a few hundred changes relevant to the ongoing investigation, we rely on a large language model-based ranker system to identify the root cause across these changes.

Our approach to root cause isolation

The ranker system uses a Llama model to further reduce the search space from hundreds of potential code changes to a list of the top five. We explored different ranking algorithms and prompting scenarios and found that ranking through election was most effective to accommodate context window limitations and enable the model to reason across different changes.

Training and Fine-Tuning

The biggest lever to achieving 42% accuracy was fine-tuning a Llama 2 (7B) model using historical investigations for which we knew the underlying root cause. We started by running continued pre-training (CPT) using limited and approved internal wikis, Q&As, and code to expose the model to Meta artifacts. Later, we ran a supervised fine-tuning (SFT) phase where we mixed Llama2’s original SFT data with more internal context and a dedicated investigation root cause analysis (RCA) SFT dataset to teach the model to follow RCA instructions.

Training and fine-tuning the Llama model

The Future of AI-Assisted Investigations

The application of AI in this context presents both opportunities and risks. For instance, it can reduce effort and time needed to root cause an investigation significantly, but it can potentially suggest wrong root causes and mislead engineers. To mitigate this, we ensure that all employee-facing features prioritize closed feedback loops and explainability of results.

The future of AI-assisted investigations

By integrating AI-based systems into our internal tools, we’ve successfully leveraged them for tasks like onboarding engineers to investigations and root cause isolation. Looking ahead, we envision expanding the capabilities of these systems to autonomously execute full workflows and validate their results. Additionally, we anticipate that we can further streamline the development process by utilizing AI to detect potential incidents prior to code push, thereby proactively mitigating risks before they arise.