feat(cron): add automatic alert detector for anomaly monitoring

Implement AlertDetector background task that runs every minute to detect
and create alerts for various anomalies:

- Rate limit detection: monitors masters hitting rate limits
- Error spike detection: flags keys with >= 10% error rate
- Quota exceeded: warns when key quota usage >= 90%
- Provider down: alerts when API keys have >= 50% failure rate

Includes fingerprint-based deduplication with 5-minute cooldown to
prevent duplicate alerts for the same issue.
This commit is contained in:
zenfun
2025-12-31 14:49:51 +08:00
parent 6cab7e257a
commit 85d91cdd2e
3 changed files with 359 additions and 0 deletions

View File

@@ -524,6 +524,33 @@ curl -X POST http://localhost:8080/internal/alerts/report \
}
```
### 6.6 自动告警检测 (AlertDetector)
CP 侧运行后台任务,每分钟自动检测异常并生成告警。
**检测规则**
| 规则 | 类型 | 严重性 | 说明 |
| :--- | :--- | :--- | :--- |
| 速率限制 | `rate_limit` | warning | 检测 Redis 中被限流的 Master |
| 错误飙升 | `error_spike` | info/warning/critical | 近 5 分钟错误率 >= 10%>=50% 为 critical |
| 配额超限 | `quota_exceeded` | warning/critical | Key 配额使用 >= 90%(达到 100% 为 critical |
| 上游故障 | `provider_down` | critical | API Key 失败率 >= 50% 且失败次数 >= 10 |
**去重机制**
- 基于 `fingerprint``type:related_type:related_id`)去重
- 5 分钟内同一 fingerprint 的活跃告警不重复创建
**配置默认值**
```go
Interval: 1 * time.Minute // 检测间隔
ErrorSpikeThreshold: 0.1 // 错误率阈值 (10%)
ErrorSpikeWindow: 5 * time.Minute // 错误统计窗口
QuotaWarningThreshold: 0.9 // 配额告警阈值 (90%)
ProviderFailThreshold: 10 // 上游失败次数阈值
DeduplicationCooldown: 5 * time.Minute // 去重冷却期
```
---
## 7. 备注