mirror of
https://github.com/EZ-Api/ez-api.git
synced 2026-01-13 17:47:51 +00:00
feat(alerts): add traffic spike detection with configurable thresholds
Introduce traffic_spike alert type for monitoring system and per-master traffic levels with configurable thresholds stored in database. - Add AlertThresholdConfig model for persistent threshold configuration - Implement GET/PUT /admin/alerts/thresholds endpoints for threshold management - Add traffic spike detection in alert detector cron job: - Global QPS monitoring across all masters - Per-master RPM/TPM checks with minimum sample thresholds - Per-master RPD/TPD checks for daily limits - Use warning severity at threshold, critical at 2x threshold - Include metric metadata (value, threshold, window) in alert details - Update API documentation with new endpoints and alert type
This commit is contained in:
70
docs/api.md
70
docs/api.md
@@ -398,6 +398,7 @@ curl "http://localhost:8080/admin/api-keys?group_id=1&status=suspended" \
|
||||
| `key_disabled` | Key 被禁用 |
|
||||
| `key_expired` | Key 已过期 |
|
||||
| `provider_down` | 上游服务不可用 |
|
||||
| `traffic_spike` | 流量阈值超限 |
|
||||
|
||||
| 严重性 (`severity`) | 说明 |
|
||||
| :--- | :--- |
|
||||
@@ -422,6 +423,8 @@ curl "http://localhost:8080/admin/api-keys?group_id=1&status=suspended" \
|
||||
| `/admin/alerts/:id/resolve` | POST | 解决告警 |
|
||||
| `/admin/alerts/:id` | DELETE | 忽略告警 (软删除) |
|
||||
| `/admin/alerts/stats` | GET | 获取告警统计 |
|
||||
| `/admin/alerts/thresholds` | GET | 获取流量阈值配置 |
|
||||
| `/admin/alerts/thresholds` | PUT | 更新流量阈值配置 |
|
||||
|
||||
#### 创建告警
|
||||
```bash
|
||||
@@ -536,10 +539,16 @@ CP 侧运行后台任务,每分钟自动检测异常并生成告警。
|
||||
| 错误飙升 | `error_spike` | info/warning/critical | 近 5 分钟错误率 >= 10%(>=50% 为 critical) |
|
||||
| 配额超限 | `quota_exceeded` | warning/critical | Key 配额使用 >= 90%(达到 100% 为 critical) |
|
||||
| 上游故障 | `provider_down` | critical | API Key 失败率 >= 50% 且失败次数 >= 10 |
|
||||
| 全局 QPS | `traffic_spike` | warning/critical | 系统 QPS >= 阈值(>= 2x 阈值为 critical) |
|
||||
| Master RPM | `traffic_spike` | warning/critical | Master 每分钟请求数 >= 阈值(满足最小样本) |
|
||||
| Master TPM | `traffic_spike` | warning/critical | Master 每分钟 Token 数 >= 阈值(满足最小样本) |
|
||||
| Master RPD | `traffic_spike` | warning/critical | Master 每日请求数 >= 阈值 |
|
||||
| Master TPD | `traffic_spike` | warning/critical | Master 每日 Token 数 >= 阈值 |
|
||||
|
||||
**去重机制**:
|
||||
- 基于 `fingerprint`(`type:related_type:related_id`)去重
|
||||
- 5 分钟内同一 fingerprint 的活跃告警不重复创建
|
||||
- traffic_spike 告警使用 `type:metric:related_type:related_id` 格式
|
||||
|
||||
**配置默认值**:
|
||||
```go
|
||||
@@ -551,6 +560,67 @@ ProviderFailThreshold: 10 // 上游失败次数阈值
|
||||
DeduplicationCooldown: 5 * time.Minute // 去重冷却期
|
||||
```
|
||||
|
||||
### 6.7 流量阈值配置
|
||||
|
||||
管理员可通过 API 配置流量告警阈值,无需重启服务。
|
||||
|
||||
#### 获取阈值配置
|
||||
```bash
|
||||
curl http://localhost:8080/admin/alerts/thresholds \
|
||||
-H "Authorization: Bearer <admin_token>"
|
||||
```
|
||||
|
||||
**响应示例**:
|
||||
```json
|
||||
{
|
||||
"global_qps": 100,
|
||||
"master_rpm": 20,
|
||||
"master_rpd": 1000,
|
||||
"master_tpm": 10000000,
|
||||
"master_tpd": 100000000,
|
||||
"min_rpm_requests_1m": 10,
|
||||
"min_tpm_tokens_1m": 1000000,
|
||||
"updated_at": 1704153600
|
||||
}
|
||||
```
|
||||
|
||||
#### 更新阈值配置
|
||||
```bash
|
||||
curl -X PUT http://localhost:8080/admin/alerts/thresholds \
|
||||
-H "Authorization: Bearer <admin_token>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"global_qps": 200,
|
||||
"master_rpm": 50
|
||||
}'
|
||||
```
|
||||
|
||||
**阈值字段说明**:
|
||||
|
||||
| 字段 | 默认值 | 说明 |
|
||||
| :--- | :--- | :--- |
|
||||
| `global_qps` | 100 | 系统全局 QPS 阈值 |
|
||||
| `master_rpm` | 20 | 每 Master 每分钟请求数阈值 |
|
||||
| `master_rpd` | 1000 | 每 Master 每日请求数阈值 |
|
||||
| `master_tpm` | 10,000,000 | 每 Master 每分钟 Token 数阈值 |
|
||||
| `master_tpd` | 100,000,000 | 每 Master 每日 Token 数阈值 |
|
||||
| `min_rpm_requests_1m` | 10 | RPM 检测最小样本(低于此值跳过 RPM 检测) |
|
||||
| `min_tpm_tokens_1m` | 1,000,000 | TPM 检测最小样本(低于此值跳过 TPM 检测) |
|
||||
|
||||
**严重性规则**:
|
||||
- `warning`: 值 >= 阈值
|
||||
- `critical`: 值 >= 2x 阈值
|
||||
|
||||
**traffic_spike 告警元数据**:
|
||||
```json
|
||||
{
|
||||
"metric": "master_rpm",
|
||||
"value": 150,
|
||||
"threshold": 20,
|
||||
"window": "1m"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. 备注
|
||||
|
||||
Reference in New Issue
Block a user