Service Reliability Management (SRM)

The Harness Service Reliability Management (SRM) module helps SRE teams proactively monitor system reliability, quickly identify and respond to issues, and ensure services meet SLOs (Service Level Objectives).

Core Features

SLO Management

SLO Definition: Define and track key service level objectives
Error Budget Tracking: Monitor remaining error budget
SLO Status Dashboard: Real-time visibility into service health
Predictive Analysis: Predict SLO achievement based on historical data

Proactive Monitoring

Health Scoring: Comprehensive health assessment of services
Anomaly Detection: Automatically identify performance anomalies
Root Cause Analysis: Quickly locate problem sources
Correlation Analysis: Correlate logs, traces, and metrics data

Incident Management

Auto Alerting: Trigger alerts based on SLO violations
On-Call Integration: Integrate with PagerDuty, Slack, and other alerting tools
Event Tracking: Complete fault event records
Post-Mortem Analysis: Post-incident analysis and improvement suggestions

Use Cases

Scenario	SRM Features
SLO Tracking	Error Budget calculation and alerting
Anomaly Detection	Automatic identification of performance degradation
Incident Response	Fast alerting and event tracking
Reliability Assessment	Service health scoring

Getting Started

1. Define SLO

Define key metrics and targets for services:

slo:
  name: payment-service-availability
  target: 99.9%
  window: 30d
  indicator:
    type: availability
    good: http.status < 500
    total: http.total

2. Configure Monitoring

Connect data sources and set up metric collection rules.

3. Set Alerts

Configure Error Budget alert rules and thresholds.

4. Dashboard Monitoring

Monitor service health status through the SRM dashboard.

5. Incident Response

When faults occur, SRM automatically creates events and notifies relevant personnel.

Best Practices

Start with Critical Services: Prioritize defining SLOs for revenue-related services
Set Reasonable SLIs: Choose metrics that truly reflect user experience
Error Budget Alerts: Alert when Error Budget is consumed too quickly
Continuous Improvement: Optimize systems based on post-mortems

Key Metrics

Metric	Description
Availability	Percentage of time service responds normally
Latency	Request response time (e.g., P99 latency)
Error Rate	Ratio of failed requests to total requests
Throughput	Number of requests processed per second

Default

​Service Reliability Management (SRM)

​Core Features

​SLO Management

​Proactive Monitoring

​Incident Management

​Use Cases

​Getting Started

​1. Define SLO

​2. Configure Monitoring

​3. Set Alerts

​4. Dashboard Monitoring

​5. Incident Response

​Best Practices

​Key Metrics

​Related Resources