The VMSSPI story - Making the Most of a Legacy Asset

The VMSSPI story - Making the Most of a Legacy Asset

Back in April 2022, I was on a call with a customer that I have had the pleasure of working with for many years. We were talking about a variety of topics, and one of the things they mentioned during the course of conversation was that as a consequence of the HP OpenView Smart Plug-In (SPI) for OpenVMS (VMSSPI) no longer being supported, they were establishing a project to design and build a replacement solution for monitoring their OpenVMS systems that would integrate with Dynatrace, which they had recently adopted as their new enterprise-wide incident management and monitoring solution. I have talked previously about the unfortunate decline in the number of ISV solutions available for OpenVMS, but in this case the situation was perhaps somewhat different, with the customer looking not only to replace the VMSSPI OpenView agent but OpenView as a whole across their entire business, encompassing multiple operating systems, using an arguably more modern monitoring solution.

When it was first created, it is fair to say that HP OpenView was a product that was somewhat ahead of its time, with agents for just about everything conceivable and providing a centralised view of events across the enterprise, however with the advent of cloud computing and some of the technologies surrounding it, a new breed of such solutions have emerged, including offerings such as Splunk, Dynatrace, PagerDuty, and Datadog, to name just a few, and these newer solutions have in recent years (and for generally good reason) seen very considerable adoption, particularly by enterprise customers, and irrespective of the VMSSPI OpenVMS agent being no longer supported, I could fully understand the customers' decision to move away from OpenView to one of these newer solutions.

At around the same time as the customer conversation, we had internally been starting to discuss developing the various service offerings that we provide today, including the provision of managed services to customers requiring help looking after their OpenVMS systems, and while there was total agreement that offering such a service was a great idea and that we most definitely had the skills to do the job, there was a valid concern that we were not well-placed from a tools perspective with regard to proactive and remote monitoring of the customer systems in question. With the customer conversation fresh in my mind and possessing a general reluctance to accept "no" as an answer, I decided to take a look at the VMSSPI agent code (which was provided to us as part of our agreement with HP) to see how it interfaced to OpenView and whether it could be adapted in some way to work with something other than OpenView.

Upon tracking down the VMSSPI code, it did not take very long to determine that the interface to OpenView was little more than three relatively simple function calls that could indeed be readily replaced with an alternative interface to allow VMSSPI to send alerts to other incident management and monitoring systems. Good news so far, and I could sense that a cunning plan was starting to come together. However, before we could get too excited and claim any sort of victory there were one or two other matters that also needed to be addressed, in particular a mechanism needed to be devised to convert the rather curious looking and presumably OpenView-specific file containing the definitions and details of all messages used by the OpenView VMSSPI module into something that could be used in conjunction with any new interface mechanism we might devise, without having to make any wholesale changes to the VMSSPI code. Thankfully this too was straightforward to resolve, and in relatively short order we had all messages converted into the readily parsed and loadable textual format illustrated below and we were able to implement a simple interface to this information that did not require any particular changes to the VMSSPI code itself, with the resultant file now commonly referred to as the VMSSPI “messages file”.

Figure 1. Sample VMSSPI message definition. In addition to specifying general details regarding the scenario in question such as a description of the problem, severity (possibly based on one or more thresholds), and classification (operational area), it is also possible to specify actions to be taken (which may be different, depending on the alert threshold and severity), details of email notifications (if any), and so on.

We also took the liberty here of adding the ability to optionally specify various additional details for each message, the most significant of these being as follows (see above example):

  • The action field can be used to specify an action to be taken when the associated event is detected by VMSSPI, where the specified action is a string containing any valid DCL command (such as running a DCL command procedure). An obvious use for this facility could be to restart a failed process, for example.
  • The throttle field can be used to control the frequency with which alerts will be reported by VMSSPI for the particular event in question. This can be very useful for preventing message storms, as I learned the hard way many years ago when implementing a somewhat similar solution for a government department, and ended up taking down their email server due to a massive flood of messages (although in all honesty I can say that I was not responsible for causing the problem that precipitated the storm in the first place).
  • The notes field can be used to specify arbitrary notes for the event in question, and any such text will be included in the body of emails sent by VMSSPI, along with other message details. Such notes might for example be used to describe actions to be taken when the event in question occurs, to describe any history related to the event, and so on.

Other fields such as message severity, facility, and message classification provide various ways to categorize events that have meaning to one or more of the incident management platforms to which VMSSPI can now send alerts (as will be discussed later), and the mailing list field provides a mechanism to send alerts via email to all email addresses specified in the named list (defined elsewhere in the messages file). Additional details about all of these fields and their usage can be found in the VMSSPI User Guide.

In addition to the new messages file, VMSSPI uses a CLI-based configuration file to define details of the items to monitor and (for some items) when to monitor them. There was no need to modify this file in any way for our VMSSPI modernization initiative (although we have now added one or two new optional features), making it possible for current OpenView VMSSPI users to continue using their existing VMSSPI configurations with the updated VMSSPI product.

With regard to being able to send alerts via email, there was some discussion about this at Bootcamp in terms of why we felt this was necessary. The decision to always provide the option of being able to send alerts via email (SMTP) was in fact made fairly early on in terms of our overall thinking about what a final revamped VMSSPI solution might look like. Obviously, the primary objective of the exercise was to facilitate sending alerts to Dynatrace and similar such services, however these services are more often than not cloud-based and off-premises, and for a non-trivial subset of our customers it is not permissible (or even possible in some cases) for their OpenVMS systems to interact directly with any such external systems. To not preclude those customers from being able to use our rejuvenated VMSSPI product and to provide a generally ubiquitous method of sending alert notifications, it seemed reasonable to provide a simple email interface, and it was straightforward to do so. Clearly using email in this way has its limitations and is not necessarily ideal in all cases, but for those situations where other options are not possible it is better than nothing, and indeed for many scenarios it is perfectly adequate. And as noted previously, it is possible to be selective in terms of which specific events will trigger emails to be sent, and the frequency of those emails can be throttled to prevent bringing down your email service due to an event storm. Fundamentally, it is always good to have options!

Figure 2. Example of a typical VMSSPI email message. The information provided can be mapped to the relevant message definition in the messages file and any variable details (such as the threshold that triggered the alert and the observed metric value) will be included in the email text. Organization details are defined in the header section of the messages file, with the organization code serving to uniquely identify the organization in question for situations where a managed service provider might be monitoring systems for multiple customers.

Interestingly, the Datadog service supports the ability to receive alert messages via email, as described in Datadog documentation, and indeed our updated VMSSPI product fully supports this interface mechanism.

Getting back to the customer conversation that precipitated this series of events, having successfully decoupled VMSSPI from OpenView it was now time to start thinking about integrating it with Dynatrace. But why stop at Dynatrace? Given the well-defined and simple nature of the interface between VMSSPI and the underlying alerting infrastructure combined with our new and highly flexible message definition facility, there was absolutely nothing to preclude the notion of creating a set of dynamically loadable interface modules (shareable images implementing a standard API that could be called from VMSSPI) to integrate with an assortment of incident management and monitoring services, or indeed to send alerts to chat services, message queues, or whatever. With this notion in mind, it did not take much effort to design a standard API and to test the general hypothesis by developing an interface module that could send alerts to one of our internal Slack channels, and based on the success of this proof-of-concept we then set about creating interface modules for Dynatrace and Spunk, which had been mentioned to us in some context or another by various customers, and for Datadog and PagerDuty, which are somewhat less commonly used by our customers, but are similarly excellent services with their own unique features.

Figure 3. Sample alert sent by VMSSPI into one of our internal Slack channels. The Slack REST API provided a simple test case for our plans before embarking on the slightly more ambitious goal of integrating VMSSPI with services such as Splunk and Dynatrace. The Slack interface remains popular internally, providing a convenient method for keeping an eye on the status of various systems.

Glossing over the finer technical details for the sake of brevity, it should be noted all of these cloud-based services (and Slack for that matter) provide fairly simple HTTP-based REST APIs that can be used for integration with custom-developed agents such as VMSSPI, the net result being that there is considerable commonality from a code perspective between the new VMSSPI interface modules for all of these services, with the main difference between the modules being around how information provided by VMSSPI (as largely determined by the messages file) is mapped to what is expected by each of the RESTful interfaces. Arguably the most challenging aspect of both implementing and configuring these modules is dealing with authentication, which invariably comes down to obtaining some form of authentication token that can be used by the module in question to authenticate against the REST API endpoint. Configuration details (such as the shareable image name, the API endpoint, and the authentication token) for the interface module(s) to be loaded by VMSSPI on startup are specified in the header section of the messages file, which also includes definitions for any mailing lists and the SMTP gateway address to be used for sending emails.

About now we were feeling somewhat pleased with the situation. We had taken what was essentially an end-of-life product that implemented some rather useful functionality, and with relatively little effort we had rejuvenated and modernized it to work with a number of popular modern incident management and monitoring systems. More importantly we had addressed both the original customer requirement and the concerns of our Managed Services team, and we had created something that would likely be of use to others. In short, we had given VMSSPI a new lease of life; we were making the most of a legacy asset.

But the story in terms of interface options doesn't quite end there (yes, this seems to be shaping up to be another one of those lengthy Brett blog posts). Somewhat ironically, for the first piece of managed services work where we utilised VMSSPI (to send alerts to the VSI team via email), after all the work we had done implementing interfaces to various fancy cloud-based services, our customer wanted to be able to receive alerts internally via a centralized Linux-based syslog interface. Not to be thwarted by this perfectly reasonable requirement, we worked with the customer to develop a new interface module for VMSSPI that could be used for this purpose, conforming to RFC 5424, and resulting in another useful VMSSPI integration option.

At this point in VMSSPI's road to recovery and reinvention, things had largely stabilised in terms of functionality, and we were now finally at the point where the vision could be fully realised, and we could and turn it into a new and hopefully useful product offering!

This is all very interesting, but what can VMSSPI actually do, and why should you care? In brief, VMSSPI is a cluster-aware OpenVMS software package comprised of three main modules that can be used to monitor for and report on system, performance, and security-related events, providing comprehensive coverage across each of these areas. The system module monitors and reports on system-related items such as processes, disks, shadow sets, queues, and so on; the security module continuously monitors and reports on (selected) security events recorded by the audit server; and the performance module monitors various resources whose usage can affect system and process performance. A complete list of the various resources that are monitored can be found in the VMSSPI User Guide and is summarised in Table 1 below.

Table 1. Summary of items that can be monitored by the VMSSPI system, performance, and security modules. Note that for various resources such as processes, queues, disks, and shadow sets it is possible to specify periods in which they should or should not be monitored.

It would be reasonable to conclude that the original creators of the OpenView VMSSPI agent for OpenVMS where most probably highly experienced OpenVMS practitioners with an excellent understanding of what is important from an operational perspective and what is not so important, resulting in a piece of software that does just what it needs to do without including any unnecessary bells and whistles that would serve only to complicate the software for no particular benefit. That being said, while our reinvented and rejuvenated VMSSPI is straightforward to install and get running (in keeping with the original design philosophy), we will generally recommend that customers engage some time from our Professional Services team to help with setup in order to ensure best results.

But the story doesn’t quite end there!

While VMSSPI is fundamentally designed to be used for the detection of operational problems of one form or another and to raise alerts when any such problems are detected, in order to identify some problem scenarios VMSSPI monitors and uses various OpenVMS system and process metrics. For example, to identify looping processes an algorithm is used that takes into consideration factors such as process CPU usage and process I/O over some time interval (with high CPU usage and no I/O during the time interval indicating a possible problem). While not wanting to create yet another performance monitoring tool for OpenVMS (we have a few of those already), it didn’t seem entirely unreasonable to take advantage of the fact that VMSSPI could provide users with some useful data from a performance monitoring perspective, if the provision of such data could be done in an optional way and without having to perform major surgery on the VMSSPI code. The merits of this notion were reinforced by several customer conversations, and a new experimental VMSSPI module was consequently created that makes available via the Prometheus wire protocol a relatively small subset of metrics used by VMSSPI, including average CPU utilization, network bandwidth utilization (per monitored network adapter), disk queue lengths, disk space utilization (per monitored disk), memory utilisation, page and swap file utilisation, buffered and direct I/O rates, and system page fault rate. Additionally, various per-process metrics (CPU, I/O, and page faults) can be recorded for selected processes that are explicitly monitored by VMSSPI. Metrics are provided by other VMSSPI modules to the performance module via a simple mailbox interface, and use of the performance module is entirely optional. It is anticipated that some additional metrics will be added to this interface over time, however as noted the primary function of VMSSPI is to detect and report problems, rather than provide a comprehensive set of performance metrics, and in general it is intended that the performance monitoring function of the product would be used in conjunction with other tools such as T4 to provide a combination of real-time and historical insights into system operation. Prometheus was chosen in part due to its wide popularity, but also because of the its relative simplicity and ability to be integrated with other products, including services such as Dynatrace and Datadog (with some caveats) and powerful graphical tools such as Grafana.

The Prometheus-Grafana interface also makes possible various other interesting integration scenarios. For example, one customer has been testing utilising the Atlassian Opsgenie Prometheus alerting capability to raise alerts when thresholds were met and have them automatically close when normal service is resumed, and to hook Grafana into their Opsgenie environment for on-call alerting.

Figure 4. Sample performance metrics provided in real-time via by the VMSSPI Prometheus interface and displayed using Grafana. My apologies for the very simplistic nature of this dashboard!

In summary, VMSSPI for VSI OpenVMS provides a powerful, flexible, and comprehensive monitoring and alerting solution for VSI OpenVMS that can be easily enhanced to support new integrations, and is based on a proven solution. The fundamental aim of the work described in this post was to make the most of a useful legacy asset to provide a mechanism for OpenVMS systems to seamlessly and efficiently share incident and performance data with modern centralized (and possibly cloud-based) services, making it possible for organizations to monitor their OpenVMS systems in much the same way as they monitor their other operating system platforms, and as an added bonus provide a tool that could be used by our Managed Services team.

There is no point in reinventing the wheel, and as I have noted many times there is very often value in old code that should not be ignored!


Brett Cameron

Jan 30th, 2025

Brett Cameron

Director of Application Services