Observability Toolkit
Operators need continuous feedback on proof-of-work drift, announce pressure, relay saturation, and TTL guarantees. This toolkit consolidates the telemetry surfaces shipped with EphemeralNet and explains how to wire them into Prometheus, Grafana, and ad-hoc diagnostics.
Telemetry surfaces
- CLI snapshots:
eph status,eph defaults,eph list,eph diagnostics, andeph ttl-auditprovide on-demand JSON or key/value snapshots. Capture them in automation to track advertised endpoints, NAT diagnostics, TTL remaining, and PoW hints. - Structured logs: Running
eph servein the foreground prints JSON-like events (command, exit code, latency, PoW attempts). Ship them to your log pipeline to correlate with metrics. - Prometheus metrics: Use
eph metrics(or callCOMMAND:METRICSover the control socket) to scrape gauges/counters:ephemeralnet_*_pow_difficulty_bitsfor handshake/announce/store targets.ephemeralnet_handshake_pow_{success,failure}_total,ephemeralnet_announce_pow_{success,failure}_total,ephemeralnet_command_store_pow_failures_totalfor drift monitoring.ephemeralnet_command_{store,fetch}_requests_totaland_duration_secondshistograms for saturation.- Storage/TTL counters such as
ephemeralnet_chunk_expirations_total.
Provided assets
This folder replaces the ad-hoc "Observability Toolkit" markdown with concrete assets:
| File | Purpose |
|---|---|
docs/observability/pow-alerts.yml | Prometheus alert rules that watch PoW failure ratios (warn at 5–7%, critical at 10–12%) and difficulty drift. Validate with promtool check rules. |
docs/observability/grafana-pow-dashboard.json | Importable Grafana dashboard plotting PoW counters, difficulty gauges, and ratios. Add more panels for announce throttling or relay usage as needed. |
Quick start
- Copy
pow-alerts.ymlinto your Prometheusrule_filespath and reload Prometheus (HUP or HTTP reload). - Import the Grafana dashboard, point it at the Prometheus datasource, and adjust the dashboard variables (
job,instance) to match your scrape targets. - Expose the daemon’s metrics endpoint securely (SSH tunnel, reverse proxy with auth) if scraping from outside localhost.
- Combine metrics with CLI snapshots in runbooks; for example, when
ANNOUNCE_RATE_LIMITEDspikes, runeph diagnosticsto inspect auto-advertise conflicts.
Operational checklist
- Alert coverage: Ensure PoW failure alerts, announce saturation, relay fairness (upload choking counters), and chunk expiration backlog all have thresholds tied to your SLA.
- Dashboards: Pair the provided PoW board with storage (chunk count, TTL backlog), networking (active sessions, NAT diagnostics), and governance (abuse reports, token lockouts) panels.
- Incident capture: Store recent metrics/logs for at least 30 days so response teams can replay incidents per the governance guide.
- Regression tests: Keep
tests/pow_monitoring.cpp,tests/ttl_audit.cpp, andtests/advertise_discovery.cppin your CI signal; their failures often precede observability regressions.
Observability is most useful when it mirrors the architecture: control-plane snapshots explain intent, data-plane metrics quantify work, and governance logs capture policy changes. Leverage all three to keep the swarm healthy.