Commit Graph

4 Commits

Author SHA1 Message Date
zenfun
4cda273f7b feat(alerts): add MasterID to log records and improve traffic spike detection
- Add MasterID field with index to LogRecord model for efficient queries
- Fix threshold config loading to use fixed ID=1 with FirstOrCreate
- Allow traffic spike detection to work without Redis for log-based checks
- Add traffic_spike to API documentation for alert type filter
- Add comprehensive tests for RPM/RPD/TPM spike detection scenarios
2025-12-31 18:01:09 +08:00
zenfun
f714a314a9 test(alerts): add comprehensive tests for alert handler and detector
Add unit tests for alert-related functionality:

- alert_handler_test.go: tests for threshold CRUD operations,
  alert creation with traffic_spike type, filtering, and stats
- alert_detector_test.go: tests for threshold config loading,
  traffic spike severity calculation, deduplication logic,
  error rate severity, and nil-safety checks

Also fix format string issues:
- Use %d instead of %.2f for integer QPS in alert messages
- Wrap error description with format directive to avoid linter warning
2025-12-31 16:09:02 +08:00
zenfun
ba54abd424 feat(alerts): add traffic spike detection with configurable thresholds
Introduce traffic_spike alert type for monitoring system and per-master
traffic levels with configurable thresholds stored in database.

- Add AlertThresholdConfig model for persistent threshold configuration
- Implement GET/PUT /admin/alerts/thresholds endpoints for threshold management
- Add traffic spike detection in alert detector cron job:
  - Global QPS monitoring across all masters
  - Per-master RPM/TPM checks with minimum sample thresholds
  - Per-master RPD/TPD checks for daily limits
- Use warning severity at threshold, critical at 2x threshold
- Include metric metadata (value, threshold, window) in alert details
- Update API documentation with new endpoints and alert type
2025-12-31 15:56:17 +08:00
zenfun
85d91cdd2e feat(cron): add automatic alert detector for anomaly monitoring
Implement AlertDetector background task that runs every minute to detect
and create alerts for various anomalies:

- Rate limit detection: monitors masters hitting rate limits
- Error spike detection: flags keys with >= 10% error rate
- Quota exceeded: warns when key quota usage >= 90%
- Provider down: alerts when API keys have >= 50% failure rate

Includes fingerprint-based deduplication with 5-minute cooldown to
prevent duplicate alerts for the same issue.
2025-12-31 14:49:51 +08:00