Executive Summary
JIL Sovereign's dual-policy remediation model resolves a fundamental tension in automated fleet management: acting on threats risks cascading failures (taking too many nodes offline), while not acting risks undetected compromise. The solution uses different policies for different threat categories.
Problem Statement
Automated remediation systems face a fundamental design conflict. All existing systems use a single policy - either always respect availability (missing security threats) or always prioritize security (risking availability cascades). Neither approach is adequate for networks securing billions in bridged assets.
- Prometheus + Alertmanager: No auto-remediation, no quorum awareness
- Kubernetes PDB: Single policy, always respects budget
- AWS Auto Scaling: No composite threat model, no category-dependent policy
Dual-Policy Architecture
Policy 1: Operational Threats
For operational issues (container down, high CPU, memory pressure, performance degradation), auto-remediation is permitted ONLY IF the action would not reduce healthy nodes below the quorum minimum: max(7, ceil(total_validators * 0.70)).
Policy 2: Security Threats
For cryptographic integrity violations (image digest mismatch indicating possible tampering), auto-remediation overrides quorum protection and executes immediately. A compromised node inside the network is a greater threat than the availability cost of removing it.
| Threat Category | Examples | Policy | Quorum Check |
|---|---|---|---|
| Operational | Container down, high CPU, memory | Quorum-protected | Yes - blocked if below minimum |
| Performance | Latency spike, throughput drop | Quorum-protected | Yes - blocked if below minimum |
| Security | Image digest mismatch, key expiry | Override | No - immediate isolation |
Quorum Computation
The quorum minimum is dynamically computed based on the current validator set size:
quorum_minimum = max(7, ceil(total_validators * 0.70))
// With 20 validators: max(7, ceil(20 * 0.70)) = max(7, 14) = 14
// With 20 validators: max(7, ceil(10 * 0.70)) = max(7, 7) = 7
// With 5 validators: max(7, ceil(5 * 0.70)) = max(7, 4) = 7
The absolute minimum of 7 ensures that even with a small validator set, sufficient redundancy is maintained for consensus safety.
Rate Limiting
Multi-level rate limiting prevents remediation storms:
- Per-node cooldown: Minimum 5-minute interval between actions on the same node
- Per-action burst limit: Maximum 3 of the same action type per inspection cycle
- Global fleet cap: Maximum 2 nodes remediated per 60-second inspection cycle