VLog exposes Prometheus metrics for comprehensive observability of the video platform.
| Service | Endpoint | Port | Description |
|---|---|---|---|
| Admin API | /metrics |
9001 | Application metrics (videos, uploads, transcoding) |
| Worker API | /api/metrics |
9002 | Worker and job queue metrics |
# Admin API metrics
curl -s http://localhost:9001/metrics | head -50
# Worker API metrics
curl -s http://localhost:9002/api/metrics | head -50Add VLog targets to your prometheus.yml:
scrape_configs:
- job_name: 'vlog-admin'
static_configs:
- targets: ['your-vlog-server:9001']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'vlog-worker-api'
static_configs:
- targets: ['your-vlog-server:9002']
metrics_path: /api/metrics
scrape_interval: 15sIf using Prometheus Operator in Kubernetes:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vlog-monitor
namespace: vlog
spec:
selector:
matchLabels:
app: vlog
endpoints:
- port: admin
path: /metrics
interval: 15s
- port: worker-api
path: /api/metrics
interval: 15s| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_http_requests_total |
Counter | method, endpoint, status_code | Total HTTP requests |
vlog_http_request_duration_seconds |
Histogram | method, endpoint | Request latency distribution |
Example queries:
# Request rate by endpoint
rate(vlog_http_requests_total[5m])
# 95th percentile latency
histogram_quantile(0.95, rate(vlog_http_request_duration_seconds_bucket[5m]))
# Error rate
sum(rate(vlog_http_requests_total{status_code=~"5.."}[5m])) / sum(rate(vlog_http_requests_total[5m]))
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_videos_total |
Gauge | status | Total videos by status (pending, processing, ready, failed) |
vlog_video_uploads_total |
Counter | result | Upload attempts (success, failed) |
vlog_video_views_total |
Counter | - | Total video views |
Example queries:
# Videos by status
vlog_videos_total
# Upload success rate
sum(rate(vlog_video_uploads_total{result="success"}[1h])) / sum(rate(vlog_video_uploads_total[1h]))
# Views per hour
rate(vlog_video_views_total[1h]) * 3600
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_transcoding_jobs_total |
Counter | status | Jobs by status (started, completed, failed, retried) |
vlog_transcoding_jobs_active |
Gauge | - | Currently active transcoding jobs |
vlog_transcoding_job_duration_seconds |
Histogram | quality | Job duration by quality level |
vlog_transcoding_queue_size |
Gauge | - | Jobs waiting in queue |
Example queries:
# Active jobs
vlog_transcoding_jobs_active
# Job failure rate
sum(rate(vlog_transcoding_jobs_total{status="failed"}[1h])) / sum(rate(vlog_transcoding_jobs_total{status="started"}[1h]))
# Average transcoding time by quality
histogram_quantile(0.5, rate(vlog_transcoding_job_duration_seconds_bucket[1h]))
# Queue depth
vlog_transcoding_queue_size
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_workers_total |
Gauge | status | Workers by status (online, offline) |
vlog_worker_heartbeat_total |
Counter | worker_id, result | Heartbeat attempts per worker |
Example queries:
# Online workers
vlog_workers_total{status="online"}
# Workers with failed heartbeats
increase(vlog_worker_heartbeat_total{result="failed"}[5m])
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_reencode_queue_size |
Gauge | status | Re-encode jobs by status |
vlog_reencode_jobs_total |
Counter | status | Total re-encode jobs processed |
Example queries:
# Pending re-encode jobs
vlog_reencode_queue_size{status="pending"}
# Re-encode completion rate
rate(vlog_reencode_jobs_total{status="completed"}[1h])
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_db_connections_active |
Gauge | - | Active database connections |
vlog_db_query_retries_total |
Counter | - | Query retries due to transient errors |
vlog_db_query_duration_seconds |
Histogram | operation | Query latency by operation type |
Example queries:
# Connection pool usage
vlog_db_connections_active
# Slow queries (>100ms)
histogram_quantile(0.99, rate(vlog_db_query_duration_seconds_bucket[5m])) > 0.1
# Retry rate (indicates database issues)
rate(vlog_db_query_retries_total[5m])
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_redis_operations_total |
Counter | operation, result | Redis operations by type |
vlog_redis_circuit_breaker_state |
Gauge | - | Circuit breaker (0=closed, 1=open) |
Example queries:
# Redis availability
vlog_redis_circuit_breaker_state == 0
# Redis error rate
sum(rate(vlog_redis_operations_total{result="failed"}[5m])) / sum(rate(vlog_redis_operations_total[5m]))
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_storage_operations_total |
Counter | operation, result | Storage operations |
vlog_storage_bytes_written_total |
Counter | - | Total bytes written |
| Metric | Type | Labels | Description |
|---|---|---|---|
vlog_playback_sessions_active |
Gauge | - | Active playback sessions |
vlog_video_views_total |
Counter | - | Total video views |
Create vlog-alerts.yml:
groups:
- name: vlog
rules:
# High error rate on API
- alert: VLogHighErrorRate
expr: |
sum(rate(vlog_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(vlog_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "VLog API error rate above 5%"
description: "Error rate is {{ $value | humanizePercentage }}"
# Transcoding queue backing up
- alert: VLogTranscodingQueueBacklog
expr: vlog_transcoding_queue_size > 10
for: 15m
labels:
severity: warning
annotations:
summary: "Transcoding queue has {{ $value }} pending jobs"
# No active workers
- alert: VLogNoActiveWorkers
expr: vlog_workers_total{status="online"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "No VLog workers are online"
# Worker heartbeat failures
- alert: VLogWorkerHeartbeatFailures
expr: increase(vlog_worker_heartbeat_total{result="failed"}[5m]) > 3
labels:
severity: warning
annotations:
summary: "Worker {{ $labels.worker_id }} has heartbeat failures"
# Database connection issues
- alert: VLogDatabaseRetries
expr: rate(vlog_db_query_retries_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Database experiencing transient errors"
# Redis circuit breaker open
- alert: VLogRedisDown
expr: vlog_redis_circuit_breaker_state == 1
for: 1m
labels:
severity: warning
annotations:
summary: "Redis circuit breaker is open"
# High transcoding failure rate
- alert: VLogTranscodingFailures
expr: |
sum(rate(vlog_transcoding_jobs_total{status="failed"}[1h]))
/ sum(rate(vlog_transcoding_jobs_total{status="started"}[1h])) > 0.1
for: 30m
labels:
severity: warning
annotations:
summary: "Transcoding failure rate above 10%"
# Slow API responses
- alert: VLogSlowResponses
expr: |
histogram_quantile(0.95, rate(vlog_http_request_duration_seconds_bucket[5m])) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "95th percentile response time above 2 seconds"A sample Grafana dashboard is provided below. Import via Grafana UI (Dashboards > Import > Paste JSON).
Click to expand dashboard JSON
{
"title": "VLog Overview",
"uid": "vlog-overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(vlog_http_requests_total[5m])) by (endpoint)",
"legendFormat": "{{ endpoint }}"
}
]
},
{
"title": "Videos by Status",
"type": "piechart",
"targets": [
{
"expr": "vlog_videos_total",
"legendFormat": "{{ status }}"
}
]
},
{
"title": "Active Transcoding Jobs",
"type": "stat",
"targets": [
{
"expr": "vlog_transcoding_jobs_active"
}
]
},
{
"title": "Queue Depth",
"type": "stat",
"targets": [
{
"expr": "vlog_transcoding_queue_size"
}
]
},
{
"title": "Online Workers",
"type": "stat",
"targets": [
{
"expr": "vlog_workers_total{status=\"online\"}"
}
]
},
{
"title": "Transcoding Duration (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(vlog_transcoding_job_duration_seconds_bucket[1h])) by (quality)",
"legendFormat": "{{ quality }}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(vlog_http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(vlog_http_requests_total[5m]))",
"legendFormat": "Error Rate"
}
]
},
{
"title": "Database Query Latency (p95)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(vlog_db_query_duration_seconds_bucket[5m])) by (operation)",
"legendFormat": "{{ operation }}"
}
]
}
]
}In addition to metrics, VLog exposes health check endpoints:
| Service | Endpoint | Description |
|---|---|---|
| Public API | /health |
Returns 200 if service is up |
| Admin API | /health |
Returns 200 with DB and storage status |
| Worker API | /api/health |
Returns 200 if service is up |
| Worker Pod | /health (port 8080) |
Kubernetes liveness probe |
| Worker Pod | /ready (port 8080) |
Kubernetes readiness probe |
# Full health check
curl -s http://localhost:9001/health | jq .
# Response:
{
"status": "healthy",
"database": "connected",
"storage": "accessible"
}| Use Case | Recommended Interval |
|---|---|
| Real-time dashboards | 15s |
| General monitoring | 30s |
| Long-term trending | 60s |
- Short-term (high resolution): 15 days at 15s intervals
- Medium-term: 90 days at 1m downsampled
- Long-term: 1 year at 5m downsampled
Avoid high-cardinality labels that can cause metric explosion:
- Video IDs (use aggregated metrics instead)
- User IDs
- Request paths with variable segments
-
Start with the Golden Signals:
- Latency:
vlog_http_request_duration_seconds - Traffic:
vlog_http_requests_total - Errors:
vlog_http_requests_total{status_code=~"5.."} - Saturation:
vlog_transcoding_queue_size,vlog_db_connections_active
- Latency:
-
Include business metrics:
- Videos uploaded per day
- Transcoding throughput
- Views per hour
-
Add annotations for deployments:
- Mark deployment times on graphs
- Correlate performance changes with releases
-
Verify endpoint is accessible:
curl -v http://localhost:9001/metrics
-
Check Prometheus targets:
- Navigate to Prometheus UI > Status > Targets
- Verify VLog targets are "UP"
-
Check firewall rules:
sudo firewall-cmd --list-ports # Should include 9001/tcp and 9002/tcp
If Prometheus warns about high cardinality:
- Check for high-cardinality labels
- Consider aggregating metrics
- Adjust retention policies
-
Verify Worker API is running:
systemctl status vlog-worker-api
-
Check if workers are registered:
vlog worker list