Skip to main content

Chaos Engineering

The Harness Chaos Engineering (CE) module is built on the CNCF Litmus project, helping teams discover system weaknesses by proactively injecting faults and improve production environment resilience and reliability.

Core Features

Fault Library

Built-in 200+ pre-configured faults covering multiple scenarios:
CategoryFault Types
KubernetesPod Kill, Container Kill, Pod CPU/Memory Stress
Cloud PlatformAWS AZ Failure, Azure VM Stress, GCP Network Latency
NetworkNetwork Partition, DNS Failure, Packet Loss
ApplicationProcess Kill, Service Delay, Exception Injection
InfrastructureDisk Stress, IO Delay, Node Restart

Experiment Orchestration

  • Visual Editor: Design chaos experiments through a graphical interface
  • Probe System: Verify system behavior under faults meets expectations
  • Timeline View: Visual representation of experiment execution and results

ChaosGuard

Enterprise-grade chaos engineering governance platform:
  • Expert Guardrails: Pre-defined best practices and safety boundaries
  • Auto Guardrails: Runtime protection of production systems from unexpected faults
  • Scaled Management: Manage chaos experiments across multiple teams and projects

Use Cases

ScenarioDescription
Resilience VerificationVerify system performance under component failures
SLO VerificationConfirm system meets predefined reliability goals
Drill PreparationTrain teams on fault handling capabilities
Regression TestingEnsure new version deployment maintains resilience

Getting Started

1. Install Chaos Delegate

Install Harness Chaos Delegate in your Kubernetes cluster.

2. Create Chaos Experiment

Use Chaos Studio to design experiments, selecting target applications and fault types.

3. Define Probes

Configure verification probes to confirm expected system behavior under faults:
probes:
  - name: http-probe
    type: http
    httpProbe:
      url: "http://payment-service/health"
      responseCode: "200"
      timeout: 5s

4. Execute Experiment

Execute the experiment in a non-production environment and observe system behavior.

5. Analyze Results

Evaluate system resilience based on probe results and monitoring data.

Best Practices

  1. Start Non-Production: Verify experiment scripts in test environments first
  2. Small Scope, Low Impact: Initially only affect a small amount of resources
  3. Set Safety Guardrails: Configure automatic termination conditions to prevent fault spread
  4. Monitor Key Metrics: Define criteria for experiment success/failure
  5. Document Learnings: Record discovered issues and improvement measures

Security Considerations

  • Be Cautious in Production: Production environment experiments require strict approval
  • Time Windows: Execute experiments during business off-peak hours
  • Rollback Plans: Prepare manual or automatic recovery measures
  • Team Notifications: Ensure relevant teams know about experiment plans