On-Call Automation Platform (Ops + AI)

Role: Lead for Problem Solving and Solution Implementation
Duration: Dec 2023 – Jun 2024
Company: Meituan (Beijing Sankuai Online Technology Co., Ltd.)
Tech Stack: RAG, Vector Embeddings, Semantic Search, Node.js, Real-time Dashboards, Monitoring Systems


Overview

Built an internal orchestration platform to unify alerts, monitoring, and resolution workflows for on-call operations. The system leverages RAG (Retrieval-Augmented Generation) technology to convert past incidents into a searchable knowledge corpus, significantly reducing manual troubleshooting time and improving operational safety.


Problem Statement

The on-call management at the team was chaotic, with no effective tracking system. This led to:

  • Delays in incident response
  • Inefficiencies in problem resolution
  • Increased operational costs
  • Lack of traceability for incidents and resolutions
  • High manual consultation workload for troubleshooting

Solution Architecture

Unified Orchestration Platform

  • Built a centralized platform to unify alerts, monitoring, and resolution workflows
  • Integrated with existing monitoring and alerting systems
  • Real-time dashboards for incident visibility and tracking
  • Automated anomaly detection to improve response speed

RAG-Based Incident Knowledge System

  • Leveraged RAG technology to convert past incidents into a searchable corpus
  • Used vector embeddings for semantic search across incident history
  • Enabled natural language queries to find similar past incidents and solutions
  • Automated knowledge base updates from resolved incidents

Operational Improvements

  • Improved traceability through comprehensive logging and dashboards
  • Reduced manual consultation by enabling self-service incident resolution
  • Streamlined on-call workflows with automated routing and prioritization

Technical Implementation

RAG Architecture

  • Vector Embeddings: Converted incident descriptions, error messages, and resolutions into embeddings
  • Semantic Search: Enabled similarity-based retrieval of relevant past incidents
  • Knowledge Base: Maintained a continuously updated corpus of incident-solution pairs
  • Query Interface: Natural language interface for searching incident knowledge

Monitoring & Alerting Integration

  • Integrated with company's internal monitoring systems
  • Real-time alert aggregation and prioritization
  • Automated anomaly detection using statistical models
  • Dashboard visualization for incident trends and patterns

Workflow Automation

  • Automated incident routing based on severity and type
  • Prioritization algorithms for efficient resource allocation
  • Post-incident analysis and knowledge extraction
  • Automated knowledge base updates

Key Achievements

  • 37% reduction in operational load
  • 50% reduction in manual consultation workload
  • Improved traceability and response speed through dashboards
  • Automated anomaly detection reducing time-to-detection
  • Enhanced operational safety through knowledge reuse

Impact

The platform transformed on-call operations from a reactive, manual process to a proactive, automated system. By leveraging AI and RAG technology, the team was able to:

  • Respond to incidents faster with automated detection and routing
  • Reduce operational overhead through self-service knowledge access
  • Improve system reliability by learning from past incidents
  • Scale operations without proportional increase in manual effort