mirror of https://github.com/EZ-Api/ez-api.git synced 2026-01-13 17:47:51 +00:00

Files

zenfun ba54abd424 feat(alerts): add traffic spike detection with configurable thresholds

Introduce traffic_spike alert type for monitoring system and per-master
traffic levels with configurable thresholds stored in database.

- Add AlertThresholdConfig model for persistent threshold configuration
- Implement GET/PUT /admin/alerts/thresholds endpoints for threshold management
- Add traffic spike detection in alert detector cron job:
  - Global QPS monitoring across all masters
  - Per-master RPM/TPM checks with minimum sample thresholds
  - Per-master RPD/TPD checks for daily limits
- Use warning severity at threshold, critical at 2x threshold
- Include metric metadata (value, threshold, window) in alert details
- Update API documentation with new endpoints and alert type

2025-12-31 15:56:17 +08:00

21 KiB

Raw Blame History

EZ-API 控制平面 (Control Plane) 业务文档

本文档作为 Swagger 接口定义的补充，旨在详细说明 EZ-API 的业务逻辑、核心模型关系及配置含义，帮助前端开发与运维人员快速上手。 Swagger 入口：GET /swagger/index.html

1. 快速入门

1.1 服务地址 (Base URL)

默认地址：http://{host}:8080

1.2 鉴权体系

系统采用三级鉴权机制，分别对应不同的操作权限：

角色	鉴权方式	配置来源	说明
Admin	`Authorization: Bearer <admin_token>`	环境变量 `EZ_ADMIN_TOKEN`	全局管理权限，可管理所有租户、供应商和模型。
Master	`Authorization: Bearer <master_key>`	创建 Master 时返回	租户级权限，仅能管理所属的子 Key、查看自身统计和日志。
Key	`Authorization: Bearer <child_key>`	Master 或 Admin 签发	用于调用 AI API，可通过 `/auth/whoami` 查询身份信息。
Internal	`X-Internal-Token: <token>`	环境变量 `EZ_INTERNAL_STATS_TOKEN`	内部组件（如 Data Plane）同步指标使用。

1.4 身份识别接口

使用 GET /auth/whoami 可根据 Authorization header 中的 Token 识别当前身份类型：

Token 类型	返回 `type`	说明
Admin Token	`"admin"`	进入管理员面板
Master Key	`"master"`	进入租户自服务面板
Child Key	`"key"`	显示 Key 信息页，包含 `issued_by` 字段标识签发者

示例响应：

// Admin Token
{"type": "admin", "role": "admin"}

// Master Key
{"type": "master", "id": 1, "name": "研发团队", "group": "default", ...}

// Child Key
{"type": "key", "id": 5, "master_id": 1, "issued_by": "master", ...}

1.3 通用约定

分页处理：
- 管理端列表 (GET /admin/*)：使用 page (从 1 开始) 和 limit (默认 50，最大 200)。
- 日志列表：使用 limit 和 offset。
时间格式：
- 查询参数 (since/until)：使用 Unix 秒（整数）。
- 日志清理 (before)：使用 RFC3339 字符串 (例如 2025-01-01T00:00:00Z)。
模糊搜索：search 参数通常匹配名称、描述或关键标识符。

2. 核心业务模型

EZ-API 的核心逻辑围绕“租户-令牌-路由-供应商”展开。

2.1 资源关系图

graph TD
    Admin[Admin 管理员] -- 创建 --> Master[Master 租户]
    Master -- 签发 --> Key[Key 子令牌]
    Master -- 关联 --> Namespace[Namespace 命名空间]
    Namespace -- 定义 --> Binding[Binding 路由规则]
    Binding -- 映射 --> ProviderGroup[ProviderGroup 上游组]
    ProviderGroup -- 包含 --> APIKey[APIKey 上游凭证]
    Binding -. 关联 .-> Model[Model 能力表]

2.2 身份与权限模型

Master (租户)：系统的顶级账户单位。
- group：路由分组，决定该租户默认使用哪一组供应商。
- epoch：版本号。当 Master 被删除或重置时，Epoch 增加，所有旧的子 Key 将立即失效。
- global_qps：租户级总限流，0 表示不限流。
Key (子令牌)：由 Master 签发给最终用户使用的 API Key。
- scopes：权限描述，仅用于业务标识。
- quota_limit：总额度限制（Token 数），-1 表示无限额。
- quota_reset_type：额度重置周期（如 daily, monthly）。
- allow_ips/deny_ips：支持 CIDR 格式的 IP 白名单/黑名单。

2.3 路由与供应商模型

ProviderGroup (上游组)：一组同类型上游定义（如 OpenAI / Anthropic / Compatible）。
- base_url、models 在 Group 层统一声明。
- name 即路由的 route_group（内部使用）。
APIKey (上游凭证)：具体可用的 key 池。
- weight：同组内负载均衡权重。
- auto_ban / ban_until：支持自动熔断与手动禁用。
Model (模型能力表)：全局能力注册表（上下文长度、功能支持等）。
Namespace (命名空间)：逻辑隔离单位，用于组织 Binding 规则。
Binding (路由规则)：核心路由逻辑。
- 将 bindingKey = namespace.public_model 映射到 group_id（ProviderGroup）。
- 同一 bindingKey 可存在多条 Binding（候选），通过 weight 进行加权选择。
- selector_type + selector_value 用于从 Group 的 models 中解析上游 true_model。
- 能力校验以 bindingKey 为 key（CP 会将 true_model 能力汇总到 bindingKey）。
- 客户端请求若未携带 namespace，DP 会使用 key 的 default_namespace 生成 bindingKey。

2.4 状态机约定

状态值	适用对象	含义
`active`	所有	正常可用。
`suspended`	Master, Key, Namespace	已停用，请求将被拦截。
`auto_disabled`	APIKey	因故障被系统自动熔断。
`manual_disabled`	APIKey	被管理员手动禁用。

3. 功能开关 (Feature Flags)

系统支持通过 API 动态调整运行时行为。

3.1 存储逻辑

常规开关：存储在 Redis Hash meta:features 中。
日志策略：由于涉及清理逻辑，存储在独立 Key 中。

3.2 常用配置项

配置项	类型	说明	默认值
`dp_state_store_backend`	string	状态存储后端：`memory` (单机) / `redis` (集群)。	`memory`
`dp_claude_cross_upstream`	bool	是否允许将 Claude 协议请求路由到 OpenAI 兼容与 Google-family 上游。	`true`
`log_request_body_enabled`	bool	是否在日志中记录请求体（注意隐私风险）。	`true`
`log_retention_days`	int	日志保留天数。写入 Redis `meta:log:retention_days`。	`30`
`log_max_records`	int	最大日志保留条数。写入 Redis `meta:log:max_records`。	`1000000`

注意：更新 Feature 后，Data Plane (DP) 会在下一个同步周期（通常为秒级）自动加载新配置。

4. API 模块概览

4.0 公开接口 - 无需鉴权

端点	说明	响应示例
`GET /health`	健康检查（含依赖状态）	`{"status": "ok", "database": "up", "redis": "up", "uptime": "1h30m"}`
`GET /status`	公开状态摘要	`{"status": "ok", "uptime": "1h30m", "version": "v0.1.0"}`
`GET /about`	系统信息	`{"name": "EZ-API Gateway", "version": "v0.1.0", ...}`
`GET /auth/whoami`	身份识别	根据 Token 返回身份类型和详细信息
`GET /swagger/*`	API 文档	Swagger UI

4.1 管理端 (Admin API) - 需 Admin Token

租户管理：创建 Master、签发子 Key、实时 QPS 监控、冻结/解冻。
上游管理：ProviderGroup + APIKey 的 CRUD 与批量操作。
模型注册表：管理全局模型能力，支持从远程 models.dev 刷新和回滚。
日志审计：全站请求日志查询、按条件批量删除日志、配置 Webhook 告警。
Dashboard 统计：聚合指标摘要、系统级实时 QPS、按小时/分钟统计。
告警管理：系统告警的查看、确认、解决和统计。

4.2 租户端 (Master API) - 需 Master Key

自服务：查看租户信息、管理自己的子 Key。
数据分析：查看所属租户的请求日志、用量统计、实时 QPS。

4.3 内部接口 (Internal API) - 需 Internal Token

指标回传：Data Plane 定期调用 /internal/stats/flush 同步 Token 消耗和请求数。

5. 典型操作示例

5.1 租户与令牌管理

创建 Master 租户：

curl -X POST http://localhost:8080/admin/masters \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "研发团队",
    "group": "default",
    "max_child_keys": 10,
    "global_qps": 50
  }'

为 Master 签发子 Key（注意：key_secret 仅在此处返回一次）：

curl -X POST http://localhost:8080/admin/masters/{master_id}/keys \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "scopes": "chat,embedding",
    "quota_limit": 5000000,
    "quota_reset_type": "monthly",
    "model_limits_enabled": true,
    "model_limits": "gpt-4,claude-3-opus"
  }'

5.2 供应商与路由配置

创建 ProviderGroup：

curl -X POST http://localhost:8080/admin/provider-groups \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "openai-default",
    "type": "openai",
    "base_url": "https://api.openai.com/v1",
    "models": ["gpt-4o", "gpt-4o-mini"]
  }'

创建 APIKey：

curl -X POST http://localhost:8080/admin/api-keys \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "group_id": 1,
    "api_key": "sk-...",
    "weight": 10
  }'

创建模型路由 (Binding)：

curl -X POST http://localhost:8080/admin/bindings \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "default",
    "public_model": "gpt-4o",
    "group_id": 1,
    "selector_type": "exact",
    "selector_value": "gpt-4o",
    "weight": 1
  }'

5.3 系统运维与监控

调整全局日志保留策略：

curl -X PUT http://localhost:8080/admin/features \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "log_retention_days": 15,
    "log_max_records": 2000000
  }'

查询特定 Key 的请求日志：

curl "http://localhost:8080/admin/logs?key_id=123&limit=20&status_code=200" \
  -H "Authorization: Bearer <admin_token>"

查看租户实时统计：

curl http://localhost:8080/admin/masters/{id}/realtime \
  -H "Authorization: Bearer <admin_token>"

5.4 租户自服务 (Master API)

租户签发子 Key：

curl -X POST http://localhost:8080/v1/tokens \
  -H "Authorization: Bearer <master_key>" \
  -H "Content-Type: application/json" \
  -d '{
    "scopes": "app-1",
    "quota_limit": 100000
  }'

租户查看自身用量统计：

curl "http://localhost:8080/v1/stats?period=today" \
  -H "Authorization: Bearer <master_key>"

5.5 内部与高级操作

手动刷新模型注册表：

curl -X POST http://localhost:8080/admin/model-registry/refresh \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{"ref": "main"}'

内部指标回传 (Internal)：

curl -X POST http://localhost:8080/internal/stats/flush \
  -H "X-Internal-Token: <internal_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "keys": [
      {
        "token_hash": "...",
        "requests": 100,
        "tokens": 5000,
        "last_accessed_at": 1734849600
      }
    ]
  }'

6. Dashboard 与告警 API

6.1 Dashboard 统计端点

获取 Dashboard 摘要

GET /admin/dashboard/summary

查询参数：

参数	类型	说明	默认值
`period`	string	预设周期：`today`, `week`, `month`, `all`	无（返回全量）
`since`	int	自定义起始时间 (Unix 秒)	-
`until`	int	自定义结束时间 (Unix 秒)	-

注意：若不传任何时间参数，返回全量数据（等同于 period=all）。

响应示例：

{
  "period": "today",
  "requests": {"total": 123456, "success": 120000, "failed": 3456, "error_rate": 0.028},
  "tokens": {"total": 9876543, "input": 4000000, "output": 5876543},
  "latency": {"avg_ms": 234.5},
  "masters": {"total": 10, "active": 8},
  "keys": {"total": 150, "active": 120},
  "provider_keys": {"total": 20, "active": 15, "suspended": 3, "auto_disabled": 2},
  "top_models": [{"model": "gpt-4o", "requests": 50000, "tokens": 2000000}],
  "updated_at": 1704153600
}

获取系统级实时统计

GET /admin/realtime

响应示例：

{
  "qps": 125,
  "rpm": 7500,
  "rate_limited_count": 2,
  "by_master": [{"master_id": 1, "qps": 50, "rate_limited": false}],
  "updated_at": 1704153600
}

日志统计按时间聚合

GET /admin/logs/stats?group_by=hour|minute

新增 group_by 选项：

值	说明	时间范围限制
`hour`	按小时聚合，返回 `hour` 字段 (ISO 8601)	无限制
`minute`	按分钟聚合，返回 `minute` 字段 (ISO 8601)	必须提供 `since` 和 `until`，最大跨度 6 小时

示例请求：

# 按小时统计 (24小时趋势)
curl "http://localhost:8080/admin/logs/stats?group_by=hour&since=1704067200&until=1704153600" \
  -H "Authorization: Bearer <admin_token>"

# 按分钟统计 (精细时序，最大6小时)
curl "http://localhost:8080/admin/logs/stats?group_by=minute&since=1704150000&until=1704153600" \
  -H "Authorization: Bearer <admin_token>"

响应示例 (hour)：

{
  "items": [
    {"hour": "2025-01-01T10:00:00Z", "count": 1234, "tokens_in": 50000, "tokens_out": 80000, "avg_latency_ms": 234.5}
  ]
}

响应示例 (minute)：

{
  "items": [
    {"minute": "2025-01-01T10:30:00Z", "count": 45, "tokens_in": 2000, "tokens_out": 3500, "avg_latency_ms": 180.2}
  ]
}

6.2 API Key 状态筛选

GET /admin/api-keys?status=active|suspended|auto_disabled|manual_disabled

新增 status 参数：按上游凭证状态过滤，可与 group_id 组合使用。

# 获取所有激活的上游凭证
curl "http://localhost:8080/admin/api-keys?status=active" \
  -H "Authorization: Bearer <admin_token>"

# 获取指定组内已暂停的凭证
curl "http://localhost:8080/admin/api-keys?group_id=1&status=suspended" \
  -H "Authorization: Bearer <admin_token>"

6.3 告警管理端点

告警类型与严重性

类型 (`type`)	说明
`rate_limit`	速率限制触发
`error_spike`	错误率飙升
`quota_exceeded`	配额超限
`key_disabled`	Key 被禁用
`key_expired`	Key 已过期
`provider_down`	上游服务不可用
`traffic_spike`	流量阈值超限

严重性 (`severity`)	说明
`info`	信息通知
`warning`	警告
`critical`	严重告警

状态 (`status`)	说明
`active`	活动告警
`acknowledged`	已确认
`resolved`	已解决
`dismissed`	已忽略

告警端点列表

端点	方法	说明
`/admin/alerts`	GET	列出告警，支持 `status`、`severity`、`type` 筛选
`/admin/alerts`	POST	创建告警
`/admin/alerts/:id`	GET	获取单个告警详情
`/admin/alerts/:id/ack`	POST	确认告警
`/admin/alerts/:id/resolve`	POST	解决告警
`/admin/alerts/:id`	DELETE	忽略告警 (软删除)
`/admin/alerts/stats`	GET	获取告警统计
`/admin/alerts/thresholds`	GET	获取流量阈值配置
`/admin/alerts/thresholds`	PUT	更新流量阈值配置

创建告警

curl -X POST http://localhost:8080/admin/alerts \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "rate_limit",
    "severity": "warning",
    "title": "速率限制触发",
    "message": "Master research-team 在过去5分钟内触发了100次速率限制",
    "related_id": 5,
    "related_type": "master",
    "related_name": "research-team"
  }'

获取告警统计

curl http://localhost:8080/admin/alerts/stats \
  -H "Authorization: Bearer <admin_token>"

响应示例：

{
  "total": 100,
  "active": 5,
  "acknowledged": 10,
  "resolved": 85,
  "critical": 2,
  "warning": 3,
  "info": 0
}

6.4 APIKey 成功率统计时间范围

GET /admin/apikey-stats/summary?since=<unix>&until=<unix>

新增时间范围参数：可选的 since 和 until 参数，用于获取特定时间段内的上游凭证统计。

# 获取今天的统计
curl "http://localhost:8080/admin/apikey-stats/summary?since=1704067200&until=1704153600" \
  -H "Authorization: Bearer <admin_token>"

# 获取全量统计（不传时间参数）
curl "http://localhost:8080/admin/apikey-stats/summary" \
  -H "Authorization: Bearer <admin_token>"

响应示例：

{
  "total_requests": 50000,
  "success_requests": 48000,
  "failure_requests": 2000,
  "success_rate": 0.96,
  "failure_rate": 0.04
}

6.5 DP 告警上报 (内部接口)

POST /internal/alerts/report

内部端点：供 Data Plane 上报异常事件给 Control Plane，需要 X-Internal-Token 认证。

去重机制：

每个告警可携带 fingerprint 字段（如 rate_limit:master:5）
若 5 分钟内存在相同 fingerprint 的活跃告警，新告警将被去重

curl -X POST http://localhost:8080/internal/alerts/report \
  -H "X-Internal-Token: <internal_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "alerts": [
      {
        "type": "rate_limit",
        "severity": "warning",
        "title": "速率限制触发",
        "message": "Master production 在1分钟内触发50次限流",
        "related_id": 5,
        "related_type": "master",
        "related_name": "production",
        "fingerprint": "rate_limit:master:5"
      }
    ]
  }'

响应示例：

{
  "accepted": 1,
  "deduplicated": 0,
  "errors": []
}

6.6 自动告警检测 (AlertDetector)

CP 侧运行后台任务，每分钟自动检测异常并生成告警。

检测规则：

规则	类型	严重性	说明
速率限制	`rate_limit`	warning	检测 Redis 中被限流的 Master
错误飙升	`error_spike`	info/warning/critical	近 5 分钟错误率 >= 10%（>=50% 为 critical）
配额超限	`quota_exceeded`	warning/critical	Key 配额使用 >= 90%（达到 100% 为 critical）
上游故障	`provider_down`	critical	API Key 失败率 >= 50% 且失败次数 >= 10
全局 QPS	`traffic_spike`	warning/critical	系统 QPS >= 阈值（>= 2x 阈值为 critical）
Master RPM	`traffic_spike`	warning/critical	Master 每分钟请求数 >= 阈值（满足最小样本）
Master TPM	`traffic_spike`	warning/critical	Master 每分钟 Token 数 >= 阈值（满足最小样本）
Master RPD	`traffic_spike`	warning/critical	Master 每日请求数 >= 阈值
Master TPD	`traffic_spike`	warning/critical	Master 每日 Token 数 >= 阈值

去重机制：

基于 fingerprint（type:related_type:related_id）去重
5 分钟内同一 fingerprint 的活跃告警不重复创建
traffic_spike 告警使用 type:metric:related_type:related_id 格式

配置默认值：

Interval:              1 * time.Minute   // 检测间隔
ErrorSpikeThreshold:   0.1               // 错误率阈值 (10%)
ErrorSpikeWindow:      5 * time.Minute   // 错误统计窗口
QuotaWarningThreshold: 0.9               // 配额告警阈值 (90%)
ProviderFailThreshold: 10                // 上游失败次数阈值
DeduplicationCooldown: 5 * time.Minute   // 去重冷却期

6.7 流量阈值配置

管理员可通过 API 配置流量告警阈值，无需重启服务。

获取阈值配置

curl http://localhost:8080/admin/alerts/thresholds \
  -H "Authorization: Bearer <admin_token>"

响应示例：

{
  "global_qps": 100,
  "master_rpm": 20,
  "master_rpd": 1000,
  "master_tpm": 10000000,
  "master_tpd": 100000000,
  "min_rpm_requests_1m": 10,
  "min_tpm_tokens_1m": 1000000,
  "updated_at": 1704153600
}

更新阈值配置

curl -X PUT http://localhost:8080/admin/alerts/thresholds \
  -H "Authorization: Bearer <admin_token>" \
  -H "Content-Type: application/json" \
  -d '{
    "global_qps": 200,
    "master_rpm": 50
  }'

阈值字段说明：

字段	默认值	说明
`global_qps`	100	系统全局 QPS 阈值
`master_rpm`	20	每 Master 每分钟请求数阈值
`master_rpd`	1000	每 Master 每日请求数阈值
`master_tpm`	10,000,000	每 Master 每分钟 Token 数阈值
`master_tpd`	100,000,000	每 Master 每日 Token 数阈值
`min_rpm_requests_1m`	10	RPM 检测最小样本（低于此值跳过 RPM 检测）
`min_tpm_tokens_1m`	1,000,000	TPM 检测最小样本（低于此值跳过 TPM 检测）

严重性规则：

warning: 值 >= 阈值
critical: 值 >= 2x 阈值

traffic_spike 告警元数据：

{
  "metric": "master_rpm",
  "value": 150,
  "threshold": 20,
  "window": "1m"
}

7. 备注

数据一致性：控制平面 (CP) 修改配置后，数据平面 (DP) 通过 Redis Pub/Sub 或定期轮询实现最终一致性。
安全性：请务必妥善保管 EZ_ADMIN_TOKEN。Master Key 和子 Key Secret 在数据库中均以哈希形式存储，丢失无法找回，只能重置。

21 KiB Raw Blame History Unescape Escape

EZ-API 控制平面 (Control Plane) 业务文档

1. 快速入门

1.1 服务地址 (Base URL)

1.2 鉴权体系

1.4 身份识别接口

1.3 通用约定

2. 核心业务模型

2.1 资源关系图

2.2 身份与权限模型

2.3 路由与供应商模型

2.4 状态机约定

3. 功能开关 (Feature Flags)

3.1 存储逻辑

3.2 常用配置项

4. API 模块概览

4.0 公开接口 - 无需鉴权

4.1 管理端 (Admin API) - 需 Admin Token

4.2 租户端 (Master API) - 需 Master Key

4.3 内部接口 (Internal API) - 需 Internal Token

5. 典型操作示例

5.1 租户与令牌管理

5.2 供应商与路由配置

5.3 系统运维与监控

5.4 租户自服务 (Master API)

5.5 内部与高级操作

6. Dashboard 与告警 API

6.1 Dashboard 统计端点

获取 Dashboard 摘要

获取系统级实时统计

日志统计按时间聚合

6.2 API Key 状态筛选

6.3 告警管理端点

告警类型与严重性

告警端点列表

创建告警

获取告警统计

6.4 APIKey 成功率统计时间范围

6.5 DP 告警上报 (内部接口)

6.6 自动告警检测 (AlertDetector)

6.7 流量阈值配置

获取阈值配置

更新阈值配置

7. 备注

21 KiB

Raw Blame History