Add IPBanManager to handle periodic background jobs including:
- Expiring outdated bans
- Syncing hit counts from Redis to DB
- Performing full Redis state synchronization
Additionally, update the service expiration logic to use system time
and add unit tests for CIDR normalization and overlap checking.
Remove the DailyStatsJob, DailyStat model, and associated database
migrations. This eliminates the pre-aggregation layer and updates the
dashboard handler to remove dependencies on the daily_stats table.
- Create `DailyStat` model for immutable daily metrics including
request counts, tokens, latency, and top models.
- Implement `DailyStatsJob` to aggregate `log_records` from the previous
day, running daily at 00:05 UTC.
- Register database migrations and schedule the job in the server.
- Add `last7d` and `last30d` period support to stats handler.
Replace custom goroutine-based scheduling in cron jobs with centralized
foundation scheduler. Each cron job now exposes a RunOnce method called
by the scheduler instead of managing its own ticker loop.
Changes:
- Remove interval/enabled config from cron job structs
- Convert Start() methods to RunOnce() for all cron jobs
- Add scheduler setup in main.go with configurable intervals
- Update foundation dependency to v0.6.0 for scheduler support
- Update tests to validate RunOnce nil-safety
- Add MasterID field with index to LogRecord model for efficient queries
- Fix threshold config loading to use fixed ID=1 with FirstOrCreate
- Allow traffic spike detection to work without Redis for log-based checks
- Add traffic_spike to API documentation for alert type filter
- Add comprehensive tests for RPM/RPD/TPM spike detection scenarios
Add unit tests for alert-related functionality:
- alert_handler_test.go: tests for threshold CRUD operations,
alert creation with traffic_spike type, filtering, and stats
- alert_detector_test.go: tests for threshold config loading,
traffic spike severity calculation, deduplication logic,
error rate severity, and nil-safety checks
Also fix format string issues:
- Use %d instead of %.2f for integer QPS in alert messages
- Wrap error description with format directive to avoid linter warning
Introduce traffic_spike alert type for monitoring system and per-master
traffic levels with configurable thresholds stored in database.
- Add AlertThresholdConfig model for persistent threshold configuration
- Implement GET/PUT /admin/alerts/thresholds endpoints for threshold management
- Add traffic spike detection in alert detector cron job:
- Global QPS monitoring across all masters
- Per-master RPM/TPM checks with minimum sample thresholds
- Per-master RPD/TPD checks for daily limits
- Use warning severity at threshold, critical at 2x threshold
- Include metric metadata (value, threshold, window) in alert details
- Update API documentation with new endpoints and alert type
Implement AlertDetector background task that runs every minute to detect
and create alerts for various anomalies:
- Rate limit detection: monitors masters hitting rate limits
- Error spike detection: flags keys with >= 10% error rate
- Quota exceeded: warns when key quota usage >= 90%
- Provider down: alerts when API keys have >= 50% failure rate
Includes fingerprint-based deduplication with 5-minute cooldown to
prevent duplicate alerts for the same issue.
Implement automatic token refresh mechanism for CPA providers (Codex,
GeminiCLI, Antigravity, ClaudeCode) with the following features:
- Periodic refresh of expiring tokens based on configurable interval
- Redis event queue processing for on-demand token refresh
- Retry logic with exponential backoff for transient failures
- Automatic key deactivation on non-retryable errors
- Provider-specific OAuth token refresh implementations
- Sync service integration to update providers after refresh
Implement LogCleaner cron job to automatically clean up old log records
based on configurable retention period and maximum record count.
- Add LogCleaner with retention_days and max_records configuration
- Add EZ_LOG_RETENTION_DAYS and EZ_LOG_MAX_RECORDS environment variables
- Default to 30 days retention and 1,000,000 max records
- Include unit tests for log cleaner functionality