On-Call Automation Platform (Ops + AI)
Role: Lead for Problem Solving and Solution Implementation
Duration: Dec 2023 – Jun 2024
Company: Meituan (Beijing Sankuai Online Technology Co., Ltd.)
Tech Stack: RAG, Vector Embeddings, Semantic Search, Node.js, Real-time Dashboards, Monitoring Systems
Overview
Built an internal orchestration platform to unify alerts, monitoring, and resolution workflows for on-call operations. The system leverages RAG (Retrieval-Augmented Generation) technology to convert past incidents into a searchable knowledge corpus, significantly reducing manual troubleshooting time and improving operational safety.
Problem Statement
The on-call management at the team was chaotic, with no effective tracking system. This led to:
- Delays in incident response
- Inefficiencies in problem resolution
- Increased operational costs
- Lack of traceability for incidents and resolutions
- High manual consultation workload for troubleshooting
Solution Architecture
Unified Orchestration Platform
- Built a centralized platform to unify alerts, monitoring, and resolution workflows
- Integrated with existing monitoring and alerting systems
- Real-time dashboards for incident visibility and tracking
- Automated anomaly detection to improve response speed
RAG-Based Incident Knowledge System
- Leveraged RAG technology to convert past incidents into a searchable corpus
- Used vector embeddings for semantic search across incident history
- Enabled natural language queries to find similar past incidents and solutions
- Automated knowledge base updates from resolved incidents
Operational Improvements
- Improved traceability through comprehensive logging and dashboards
- Reduced manual consultation by enabling self-service incident resolution
- Streamlined on-call workflows with automated routing and prioritization
Technical Implementation
RAG Architecture
- Vector Embeddings: Converted incident descriptions, error messages, and resolutions into embeddings
- Semantic Search: Enabled similarity-based retrieval of relevant past incidents
- Knowledge Base: Maintained a continuously updated corpus of incident-solution pairs
- Query Interface: Natural language interface for searching incident knowledge
Monitoring & Alerting Integration
- Integrated with company's internal monitoring systems
- Real-time alert aggregation and prioritization
- Automated anomaly detection using statistical models
- Dashboard visualization for incident trends and patterns
Workflow Automation
- Automated incident routing based on severity and type
- Prioritization algorithms for efficient resource allocation
- Post-incident analysis and knowledge extraction
- Automated knowledge base updates
Key Achievements
- 37% reduction in operational load
- 50% reduction in manual consultation workload
- Improved traceability and response speed through dashboards
- Automated anomaly detection reducing time-to-detection
- Enhanced operational safety through knowledge reuse
Impact
The platform transformed on-call operations from a reactive, manual process to a proactive, automated system. By leveraging AI and RAG technology, the team was able to:
- Respond to incidents faster with automated detection and routing
- Reduce operational overhead through self-service knowledge access
- Improve system reliability by learning from past incidents
- Scale operations without proportional increase in manual effort