0

Metrics for NetApp SolidFire backup-to-S3 in InfluxDB and Grafana

 3 weeks ago
source link: https://scaleoutsean.github.io/2024/04/24/netapp-solidfire-monitor-backup-influx-grafana-11.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Metrics for NetApp SolidFire backup-to-S3 in InfluxDB and Grafana

24 Apr 2024 -

11 minute read

Introduction

In previous post I wrote about Grafana 11 (Preview), with which good old SolidFire Collector and Graphite work like a charm.

On that occasion I also tested InfluxDB v1 Data Source in Grafana 11 - that also worked great.

Once I had SolidFire Collector, InfluxDB, and Grafana all running, I thought how I should also look at the next steps.

Some two years ago I mentioned how moving SolidFire Collector to InfluxDB would be a good idea, because it’s more widely used (and I use it in E-Series Performance Analyzer as well).

While I still don’t have enough time to do that for SolidFire Collector, there’s an easier target - my PowerShell script for parallel SolidFire backup to S3. Why?

  • SolidFire Collector doesn’t do any related monitoring
  • Backups are long-running jobs and none of the scripts I wrote had integration with logging or monitoring (although I blogged about 3rd party integrations such as Velero and Kopia, which do have metrics, but I’m now talking about my own scripts)

That’s an opportunity I wouldn’t miss!

Initiating and monitoring SolidFire backup-to-S3 jobs

I’ve blogged about SolidFire’s backup-to-S3 in many posts (example), so I’ll just skip that and focus on task at hand.

First, we need to create a backup job. Then we want to monitor it and finally we want to know whether it was or wasn’t successful. Simple!

But we need to stick to the maximum number of job slots per node, so it’s not like one giant “for each” loop, which is why I wrote those scripts.

Initiation

We’re after StartBulkVolumeRead.

Let’s backup volume ID 139.

Invoke-SFApi -Method StartBulkVolumeRead -Params @{
  "volumeID" = "139";
  "format" = "native";
  "script" = "bv_internal.py";
  "scriptParameters" = @{
    "write" = @{
    "awsAccessKeyID" = "";
    "awsSecretAccessKey" = "";
    "bucket" = "solidfire-backup";
    "prefix" = "PROD-wcwb/pvc-d793176f-2484-48ea-9255-f70215a7c5f7";
    "endpoint" = "s3";
    "hostname" = "s3.com.org.ai";
     };
  };
}

We get back an async job ID handle (176) and a key that appears in the output of other API calls that we’ll see later.

Name                           Value
----                           -----
key                            88d5819d96dfda7190b16a1426fcded0
url                            https://192.168.105.29:8443/
asyncHandle                    176

Status

Once a job has been submitted, its job handle can be queried for progress and outcome.

We use “-KeepResult:$True” in the first query so that the API retains this result a longer (and won’t disappear before the job is over).

PS> Get-SFASyncResult -ASyncResultID 176 -KeepResult:$True | ConvertTo-Json
{
  "status": "running",
  "details": {
    "volumeID": 139,
    "message": "",
    "bvID": 101
  },
  "resultType": "BulkVolume",
  "lastUpdateTime": "2024-04-24T06:44:32Z",
  "createTime": "2024-04-24T06:44:32Z"
}

We can run this in a loop every few minutes until the job is complete.

During that time we’d see something like this in Running Tasks tab of the SolidFire UI:

solidfire-backup-job-monitoring-influxdb-01-submit.png

All images here can be opened in a new tab, by the way.

Progress

As far as job progress monitoring is concerned, use Get-SFBulkVolumeJob for that. Notice “bvID: 101” in the output above, by the way. We have that here, too!

PS> Get-SFBulkVolumeJob -VolumeId 139 | ConvertTo-Json
{
  "BulkVolumeID": 101,
  "CreateTime": "2024-04-24T06:44:32Z",
  "ElapsedTime": 40,
  "Format": "native",
  "Key": "88d5819d96dfda7190b16a1426fcded0",
  "PercentComplete": 0,
  "RemainingTime": 3960,
  "SrcVolumeID": 139,
  "Status": "running",
  "Script": "bv_internal.py",
  "SnapshotID": null,
  "Type": "read",
  "Attributes": {}
}

PS> Get-SFBulkVolumeJob -VolumeId 139 | ConvertTo-Json
{
  "BulkVolumeID": 101,
  "CreateTime": "2024-04-24T06:44:32Z",
  "ElapsedTime": 152,
  "Format": "native",
  "Key": "88d5819d96dfda7190b16a1426fcded0",
  "PercentComplete": 48,
  "RemainingTime": 164,
  "SrcVolumeID": 139,
  "Status": "running",
  "Script": "bv_internal.py",
  "SnapshotID": null,
  "Type": "read",
  "Attributes": {
 "nextLba": 128000,
 "firstPendingLba": 118784,
 "startLba": 0,
 "blocksPerTransfer": 1024,
 "pendingLbas": "[122880, 123904, 124928, 118784, 119808, 125952, 120832, 126976, 121856]",
 "nLbas": 244140,
 "percentComplete": 48
  }
}

BulkVolumeID 101 maps to bvID 101 from async job handle but only two concurrent bulk volume jobs are supported on the volume in the first place, so it’s unlikely that you’d get confused in any case - you wouldn’t have more than one backup job at the same time.

Either way, cross-referencing by using bvID is possible, so we know which is which.

Couple of important points here:

  • BulkVolumeID and Key reference async job we submitted
  • RemainingTime can go up as well as down (beware if doing own math on current and previous values)
  • percentComplete never hits 100%. For example, you may get 95% as the last reading before job exits; so “($res -eq $null)” tells you you’re done (job is done). Don’t expect you’ll see a 100% here!
  • SnapshotID value will be non-zero if we specify one. In the script I mentioned I have a switch that can automatically use the latest snapshot for the volume, if available.

One reminder: we don’t necessarily have to obsess over job progress: we can simply reference that bulk job ID (bv101) and check for its logs in SolidFire Event Log later to see if it succeeded or not!

solidfire-backup-job-monitoring-influxdb-02-completed.png

The reason we can obsess over it is that it’s inexpensive to get and send those metrics to InfluxDB. But it’s not mandatory for knowing if a job succeeded.

I suppose users with smaller volumes wouldn’t care about job progress, but users with large (10TB, for example) volumes would.

Completion and result

Eventually jobs complete and their outcome is either success or error.

Get-SFBulkVolumeJob tells us nothing about the job outcome. Once a job is finished, you get nothing.

PS> Get-SFBulkVolumeJob -VolumeId 139 | ConvertTo-Json
PS>

Get-SFASyncResult is how we learn of the outcome. Example of a successful job (notice it references async job handle, not volume ID):

PS> Get-SFASyncResult -ASyncResultID 176 -KeepResult:$True | ConvertTo-Json
{
  "status": "complete",
  "resultType": "BulkVolume",
  "lastUpdateTime": "2024-04-24T06:49:14Z",
  "result": {
    "volumeID": 139,
    "message": "Bulk volume job succeeded",
    "bvID": 101
  },
  "createTime": "2024-04-24T06:44:32Z"
}

Example of a failed job:

PS> Get-SFASyncResult -ASyncResultID 173 -KeepResult:$True | ConvertTo-Json
{
  "status": "complete",
  "resultType": "BulkVolume",
  "lastUpdateTime": "2024-04-24T06:22:43Z",
  "error": {
    "name": "xBulkVolumeScriptFailure",
    "message": "Bulk volume job failed",
    "volumeID": 139,
    "bvID": 98
  },
  "createTime": "2024-04-24T06:20:57Z"
}

My backup-to-S3 script v2 in Awesome SolidFire already checks these things. We just need to send that output to InfluxDB v1 and then visualize it with Grafana 11.

Sending metrics to InfluxDB v1

While you may send to InfluxDB whatever you want, I’d start with just two metrics: backup job and backup progress.

I haven’t thought about it a lot, but I’d start with something simple.

Backup job:

  • Tags: cluster name, volume name, volume ID, bulk volume job ID (maybe more, e.g. K8s PVC, namespace)
  • One data field: status (one of: submitted, running, complete)

Backup progress:

  • Tags: similar tags as for backup job
  • Two data fields: percentDone (int), status

We can simulate that by manually inserting data to InfluxDB.

For backup jobs:

PS> Write-Influx -Measure backup -Tags @{cluster="PROD";volName="pvc-d793176f-2484-48ea-9255-f70215a7c5f7";volId=139,bvId=101} -Metrics @{status="submitted"} -Database example -Server http://192.168.50.184:32290 -Verbose
PS> Write-Influx -Measure backup -Tags @{cluster="PROD";volName="pvc-d793176f-2484-48ea-9255-f70215a7c5f7";volId=139,bvId=101} -Metrics @{status="running"} -Database example -Server http://192.168.50.184:32290 -Verbose
PS> Write-Influx -Measure backup -Tags @{cluster="PROD";volName="pvc-d793176f-2484-48ea-9255-f70215a7c5f7";volId=139,bvId=101} -Metrics @{status="running"} -Database example -Server http://192.168.50.184:32290 -Verbose
PS> Write-Influx -Measure backup -Tags @{cluster="PROD";volName="pvc-d793176f-2484-48ea-9255-f70215a7c5f7";volId=139,bvId=101;result="ok"} -Metrics @{status="complete",} -Database example -Server http://192.168.50.184:32290 -Verbose

How you want to visualize or show that in Grafana is up to you. If we backup a volume every 24 hours, that query can be something as simple as a table that shows last 24 hours of records (which would show submitted, running, complete) for all volumes we care about.

solidfire-backup-job-monitoring-influxdb-03-influx-backup-metric.png

Grafana 11 can conditionally format table rows, so we can show a bunch of these on a page and emphasize only those that failed (i.e. “result=”ng”) or create Grafana alerts for that.

For backup progress:

PS> Write-Influx -Measure backupprogress -Tags @{cluster="PROD";srcVolId=139;snapId=0;bvId=101} -Metrics @{pctDone=0;status="running"} -Database example -Server http://192.168.50.184:32290 -Verbose
PS> Write-Influx -Measure backupprogress -Tags @{cluster="PROD";srcVolId=139;snapId=0;bvId=101} -Metrics @{pctDone=50;status="running"} -Database example -Server http://192.168.50.184:32290 -Verbose

Regarding percentComplete (which is unlikely to ever be 100%) - nothing prevents us from creating our own final data point with “pctDone=100; status="complete"” - if we’ve got status="complete" and result="ok" from the async job for the same bvId, it’s fair to assume its progress is 100%.

PS> Write-Influx -Measure backupprogress -Tags @{cluster="PROD";srcVolId=139;snapId=0;bvId=101} -Metrics @{pctDone=100;status="complete"} -Database example -Server http://192.168.50.184:32290 -Verbose

Again, if we backup to S3 once a day, we wouldn’t see a bunch of jobs for each volume. We’d query last 24 hours and maybe just show results (failed = red, success = green), as we would presumably have a lot of volumes and watch them in a table potentially focusing just on those with “successful backups in last 24 hours = 0”.

But, let’s say we have one “pet” volume which we like to watch because it often fails to backup to the public cloud.

In that case we may want to watch its progress over time, like this (where we see two recent backups). It starts at close to zero percent complete and moves up to 100% (I insert the last value on my own once I know job completed successfully, as explained above).

solidfire-backup-job-monitoring-influxdb-04-influx-backupprogress-metric.png

Data points representing backup job progress are red at the beginning, change to orange and yellow as they go up, and become green at over 50%. That way I can visualize progress of jobs from different volumes.

If query shows data points from multiple backups over time, I can see the newer backup was smoother while the one on the left looks like it almost got stuck at one point.

We can also use other visualization types such as gauge (see Appendix) which may be neat for folks with 10-50 volumes. Maybe try heatmap for more?

To submit data to InfluxDB v1 I used this community module (license: MIT), but you can use another (there are several) or simply write your own. Or use Python for all of the above.

The little DB that could

Under this very light load, my InfluxDB v1 container used 105 MiB of RAM. View from my Kubernetes worker:

 PID USER      PR  NI    VIRT    RES  %CPU  %MEM     TIME+ S COMMAND  
 161292 root      20   0 2588.6m 230.8m   0.0   2.9   0:22.96 S pwsh                                                                            
 318616 472       20   0 1463.7m 148.8m   0.0   1.9   0:33.72 S grafana server --homepath=/usr/share/grafana --config=/etc/grafana/grafana.ini+ 
  67599 root      20   0 1004.7m 104.5m   0.0   1.3   0:44.48 S influxd 

solidshell (PowerShell + SolidFire Core) container where I ran SolidFire and InfluxDB client is at the top and uses more than 2x more RAM.

Kubernetes jobs

solidshell reminds me that my jobs were dispatched from a Kubernetes pod.

I put it in the same namespace as InfluxDB.

You don’t have to backup Kubernetes volumes (PVs) - any volumes can be backed up as backup is done on SolidFire which knows nothing about volume contents - but I mention this because you don’t have to stand up a VM just to run PowerShell.

Also, securing your SolidFire administrator credentials may be easier on Kubernetes than in a VM.

But if you want to run that script from some Windows workstation, that will work as well, as long as you can reach both SolidFire MVIP and InfluxDB API endpoint.

Conclusion

If you don’t have a bureaucratic IT environment with up to 100 volumes, maybe you use SolidFire’s backup-to-S3.

While most packaged backup solutions - including Velero - have job monitoring built in, SolidFire’s backup-to-S3 needs some extra effort, but can be monitored just as effectively (in fact, even better, as currently Velero still has some bugs in this area).

InfluxDB v1 is far from dead, it can runs in 128 MiB RAM, and it’s easier to use and work with than most other DBs. There are Python and PowerShell clients, which makes integration easy.

Having backup job details in a cloud based VM makes it very easy to find backups in no time, which can be used in DR as well. You’ll be better off with a commercial solution, but if you don’t have one or want to take an independent backup to S3 once a week, this may be the right solution for you.

If you start with that script I wrote, logging to InfluxDB may take just 1 hour of work.

Appendix A - Gauge visualization

Simply switching to gauge will work, but it may be hard to identify the struggling volumes.

solidfire-backup-job-monitoring-influxdb-05-influx-backupprogress-gauges-small.png

With custom threshold formatting, it’s easier to spot which ones are far from being complete.

solidfire-backup-job-monitoring-influxdb-06-influx-backupprogress-gauge-thresholds.png

For some reason even though I had srcVolId in my data, I couldn’t show volume ID on the gauges (instead the best I could do was bvId tag), but I didn’t want to spend more time on this. The point is that with proper visualization it’s easy to monitor dozens of volumes this way.

solidfire-backup-job-monitoring-influxdb-07-influx-backupprogress-gauge-100pct-complete.png

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK