Introduction to the Sveltos Grafana Dashboard

The Sveltos Dashboard is designed to help users monitor key operational metrics and the status of their sveltosclusters in real-time. Grafana helps users visualize this data effectively, so they can make more efficient and informed operational decisions.

dashboard

Getting Started

With the latest Sveltos release, users can take full advantage of the Sveltos Grafana dashboard. Before we start using the capabilities, ensure Grafana and Prometheus are deployed on the Sveltos management cluster.

To allow Prometheus to collect metrics from the Sveltos management cluster, perform the below if Sveltos was installed using the Helm chart.

Helm Chart

$ helm upgrade <your release name> projectsveltos/projectsveltos -n projectsveltos --set prometheus.enabled=true

Once Grafana and Prometheus are available, proceed by adding the Prometheus data source to Grafana and then import the below Grafana dashboard.

https://raw.githubusercontent.com/projectsveltos/sveltos/main/docs/assets/sveltosgrafanadashboard.json

Note

Depending on the Grafana/Prometheus installation, identify the serviceMonitorSelector label of the Prometheus instance and import it to the Sveltos servicemonitor resources as a label. Check out the example below.

$ kubectl get servicemonitor -n projectsveltos
$ kubectl patch servicemonitor addon-controller -n projectsveltos -p '{"metadata":{"labels":{"prometheus":"example-label"}}}' --type=merge

Confirm that all metrics are linked to their corresponding panels. The dashboard should automatically detect data connections from Prometheus.

Refresh to begin plotting tracked metrics. Customize the dashboard to maximize utility -- by updating thresholds, adding/removing/editing panels, and transforming metrics tracked.

Note

Some metrics only appear on Grafana when their value is non-zero, e.g. projectsveltos_reconcile_operations_total, and projectsveltos_total_drifts. As long as Prometheus and Grafana have been configured correctly, this should not be a problem.

Detailed descriptions of the panels available on the dashboard, and the tracked metrics, are listed below.

Available Metrics

Sveltos lets users track and visualize a number of key operational metrics, which include:

projectsveltos_cluster_connectivity_status: Gauge indicating the connectivity status of each cluster, where 0 means healthy and 1 means disconnected.
projectsveltos_kubernetes_version_info: Gauge providing the Kubernetes version (major.minor.patch) of each cluster.
projectsveltos_program_charts_time_seconds_count: Counter of the total number of Helm charts deployed.
projectsveltos_program_charts_time_seconds_bucket: Histogram of the durations taken to deploy Helm charts on workload clusters.
projectsveltos_program_resources_time_seconds_count: Counter of the total number of resources deployed.
projectsveltos_program_resources_time_seconds_bucket: Histogram of the durations taken to deploy resources on workload clusters
projectsveltos_reconcile_operations_total: Counter of the total number of reconcile operations performed for Helm charts, Resources, and Kustomizations across clusters.
projectsveltos_total_drifts: Counter of the total number of configuration drifts detected in clusters, categorized by cluster and feature.
Per-Cluster program_resources_time_seconds Histograms: Histograms (per cluster) of durations taken to deploy resources, indexed by cluster information.
Per-Cluster program_charts_time_seconds Histograms: Histograms (per cluster) of durations taken to deploy Helm charts, indexed by cluster information.

Dashboard Panels

1. Cluster Connectivity Status

Type: Gauge
Purpose: Displays the connectivity status of each Kubernetes cluster managed by Sveltos.
Query Used: projectsveltos_cluster_connectivity_status
Interpretation: A “Healthy" cluster is one that is connected ( projectsveltos_cluster_connectivity_status: 0) and depicted in green. A "Disconnected" cluster (projectsveltos_cluster_connectivity_status: 1) is shown in red, to help users rapidly identify and address connectivity issues.

2. Cluster Kubernetes Version

Type: Table
Purpose: Lists the Kubernetes version deployed in each sveltoscluster.
Query Used: projectsveltos_kubernetes_version_info
Interpretation: The table displays clusters with their respective Kubernetes versions, to help users identify clusters in need of updates, and ensure compatibility everywhere.

3. Total Helm Charts Deployments

Type: Stat
Purpose: Counts the number of Helm chart deployments.
Query Used: projectsveltos_program_charts_time_seconds_count
Interpretation: Displays the number of Helm charts deployed across all sveltosclusters. This helps users assess the workload managed by Sveltos, track deployment activity, correlate any change in application performance with deployments, and optimize deployment strategies accordingly.

4. Total Resources Deployments

Type: Stat
Purpose: Counts the number of resource deployments.
Query Used: projectsveltos_program_resources_time_seconds_count
Interpretation: Displays the total count of resources deployed across all sveltosclusters. This helps users assess the workload managed by Sveltos, track deployment activity, correlate any change in application performance with deployments, and optimize deployment strategies accordingly.

5. Time to Deploy Helm Charts in a Profile

Type: Bar Chart
Purpose: Depicts the time required for deploying Helm Charts, by visualizing the 50th and 90th percentile of deployment times.
Queries Used:
histogram_quantile(0.90, projectsveltos_program_charts_time_seconds_bucket) histogram_quantile(0.50, projectsveltos_program_charts_time_seconds_bucket)
Interpretation: Provides deeper insights into the deployment times required by Helm Charts. By plotting both the 50th and the 90th percentile, this chart intends to help users gauge performance consistency and distribution, and update their deployment strategies accordingly.

6. Time to Deploy Resources in a Profile

Type: Bar Chart
Purpose: Depicts the time required for deploying Resources, by visualizing the 50th and 90th percentile of deployment times.
Queries Used:
histogram_quantile(0.90, projectsveltos_program_resources_time_seconds_bucket)
histogram_quantile(0.50, projectsveltos_program_resources_time_seconds_bucket)
Interpretation: Provides deeper insights into the resource deployment times. By plotting both the 50th and the 90th percentile, this chart intends to help users gauge performance consistency and distribution, and update their deployment strategies accordingly.

7.Time to Deploy Helm Charts in a Profile - Histogram

Type: Bar Gauge
Purpose: Provides a histogram view of deployment times for Helm charts.
Query Used: projectsveltos_program_charts_time_seconds_bucket
Interpretation: Captures the distribution of deployment times for Helm charts, and allows users to track and address long-tail latencies.

8. Time to Deploy Resources in a Profile - Histogram

Type: Bar Gauge
Purpose: Offers a histogram vieew of resource deployment times.
Query Used: projectsveltos_program_resources_time_seconds_bucket
Interpretation: Captures the distribution of deployment times for resources, and allows users to track and address long-tail latencies.

9. Deploy Helm Charts in a Profile - Latency Heatmap

Type: Heatmap
Purpose: Provides a heatmap of Helm chart deployment latencies
Query Used: sum(rate(projectsveltos_program_charts_time_seconds_bucket[5m]))
Interpretation: Highlights the frequency and duration of Helm chart deployment latencies to help users identify patterns and optimize deployment management.

10. Deploy Resources in a Profile - Latency Heatmap

Type: Heatmap
Purpose: Provides a heatmap of Resource deployment latencies
Query Used: sum(rate(projectsveltos_program_resources_time_seconds_bucket[5m]))
Interpretation: Highlights the frequency and duration of resource deployment latencies to help users identify patterns and optimize deployment management.

11. Reconciliation Operations

Type: Time Series
Purpose: Shows the number of reconciliation operations performed, categorized by cluster (type, namespace, name) and feature.
Query Used: projectsveltos_reconcile_operations_total
Interpretation: Helps users monitor reconciliation processes triggered by Sveltos across clusters, to ensure operational stability.

12. Drifts

Type: Time Series
Purpose: Tracks and displays drifts, categorized by cluster (type, namespace, name) and feature.
Query Used: projectsveltos_total_drifts
Interpretation: Allows users to monitor configuration drifts, crucial for maintaining consistency and compliance across sveltosclusters, so they may detect and rectify discrepancies in workload clusters.