Skip to main content

Service Reliability Management (SRM)

The Harness Service Reliability Management (SRM) module helps SRE teams proactively monitor system reliability, quickly identify and respond to issues, and ensure services meet SLOs (Service Level Objectives).

Core Features

SLO Management

  • SLO Definition: Define and track key service level objectives
  • Error Budget Tracking: Monitor remaining error budget
  • SLO Status Dashboard: Real-time visibility into service health
  • Predictive Analysis: Predict SLO achievement based on historical data

Proactive Monitoring

  • Health Scoring: Comprehensive health assessment of services
  • Anomaly Detection: Automatically identify performance anomalies
  • Root Cause Analysis: Quickly locate problem sources
  • Correlation Analysis: Correlate logs, traces, and metrics data

Incident Management

  • Auto Alerting: Trigger alerts based on SLO violations
  • On-Call Integration: Integrate with PagerDuty, Slack, and other alerting tools
  • Event Tracking: Complete fault event records
  • Post-Mortem Analysis: Post-incident analysis and improvement suggestions

Use Cases

ScenarioSRM Features
SLO TrackingError Budget calculation and alerting
Anomaly DetectionAutomatic identification of performance degradation
Incident ResponseFast alerting and event tracking
Reliability AssessmentService health scoring

Getting Started

1. Define SLO

Define key metrics and targets for services:
slo:
  name: payment-service-availability
  target: 99.9%
  window: 30d
  indicator:
    type: availability
    good: http.status < 500
    total: http.total

2. Configure Monitoring

Connect data sources and set up metric collection rules.

3. Set Alerts

Configure Error Budget alert rules and thresholds.

4. Dashboard Monitoring

Monitor service health status through the SRM dashboard.

5. Incident Response

When faults occur, SRM automatically creates events and notifies relevant personnel.

Best Practices

  1. Start with Critical Services: Prioritize defining SLOs for revenue-related services
  2. Set Reasonable SLIs: Choose metrics that truly reflect user experience
  3. Error Budget Alerts: Alert when Error Budget is consumed too quickly
  4. Continuous Improvement: Optimize systems based on post-mortems

Key Metrics

MetricDescription
AvailabilityPercentage of time service responds normally
LatencyRequest response time (e.g., P99 latency)
Error RateRatio of failed requests to total requests
ThroughputNumber of requests processed per second