Lincoln DeCoursey

Custom Monitoring with Net-SNMP and SolarWinds Universal Device Pollers

2025-01-13T03:00:00-05:00

Background

Early in my career, I was on the front lines in a new war against spam email. The problem was challenging and largely unsolved. Few players were in the market, and success was measured by simply having any solution, even if it wasn’t stable or fully polished. Customers tolerated “war stories” of failures because the focus was on solving the problem at all.

Over time, as the spam problem became more solvable, the industry matured, and expectations rose. A shake-out occurred, where companies that couldn’t execute well were left behind. Larger players (e.g., big tech) consolidated the leaders, bundling spam filtering as part of broader offerings. This commoditization negatively impacted smaller, independent players.

While consolidation can be bad news for mid-sized companies, it also creates opportunities. Big tech tends to bundle “good enough” solutions, leaving space for smaller companies to compete with niche, best-in-class offerings that target customers seeking premium solutions.

Overview

As it turns out, best in class is about more than feature set. Features need to be backed up by solid execution. Big tech sets a high bar; they have tons of resources to dedicate to site reliability.

Key Risks

Customer-Discovered Outages: Being unaware of your own outage and failing to promptly spin up on it is a problem. When you self-discover an outage early, you have a chance to engage with it so that your customer doesn’t have to. If your customer is forced to call your hotline, it’s hard to avoid the perception that you’re asleep at the switch.
Timeline Analysis: Once the customer reports the issue, the first thing they will demand is “when did this start?” You will have to go back through your logs and detail records to isolate the start of the incident, and it’s this incident start time—not the time that you became aware of the issue—that will form the basis of the Incident Timeline included in the post-outage write-up.
Executive Visibility: Outages reach the highest levels of scrutiny within an organization. Root Cause Analysis (RCA) and Reason For Outage (RFO) write-ups are reviewed by top leadership, both within your company and at the customer side. Seemingly little things like allowing a few hours to slip between customer updates—even if you are working the issue diligently—become big questions in an escalated incident review, providing ammunition for a narrative to be crafted against you and impacting the perceived handling of the incident.
SLA Impact: Service Level Agreements (SLAs) often include penalty clauses for downtime. Delayed detection means logging significant downtime, which can lead to financial penalties with no chance for recovery.
Customer Churn and Reputational Damage: Inadequate monitoring leading to undetected outages can severely damage your organization’s reputation. Customers lose trust and confidence when they feel their service provider is unstable. This erosion of trust can impact referenceability and result in increased customer churn.

Aspects of a Successful Monitoring Operation

Aim for Excellence: Monitoring is like any other aspect of a product launch—it involves detailed work to discover requirements, build solutions, and validate their effectiveness. To achieve excellence, monitoring must be embedded in the project, not treated as an afterthought. This requires participating in project teams on equal footing, ensuring monitoring is planned and implemented collaboratively with input from all stakeholders.
Technical Acumen: Monitoring requires a wide array of insights and skills, spanning collaboration, problem-solving, and strategic thinking. However, at its core, it demands deep technical expertise. The metrics being asked for aren’t easy to reach, and there’s no pre-canned integration to them—otherwise, they would have been collected already.
Stack Deep Dive: Total failures are relatively easy to catch, but silent failures—subtle issues that don’t cause outright crashes—can be just as damaging. These problems often degrade performance gradually, creating a complex mess once discovered. Consider scenarios like one system in a cluster running an outdated configuration or calls silently taking a PSTN failover path for weeks, resulting in unexpected costs. These issues often hide in the details—replicated configurations, middleware communications, or subtle misconfigurations.
Continuous Improvement: Conducting postmortems after every incident to discuss what worked, what didn’t, and where there’s room for improvement is essential. Beyond these formal reviews, always stay vigilant for conversations and clues where enhanced monitoring could make a difference. Be the monitoring and alerting champion, proactively offering improvements even when others might not see the opportunity.
Follow Through: Improvements are often identified during customer escalations and promised as part of resolving support engagements. However, once an engagement is closed, the customer may let the issue drop, leading to less accountability for delivering on the promise. To prevent this, it’s vital to demonstrate end-to-end ownership and ensure all commitments are met.
Art Not Science / Balancing Act: Creating monitoring alerts involves a delicate balance. You’re writing code that could wake someone up in the middle of the night, so it’s not always about adding more alerts—it’s about refining and sometimes subtracting. Always seek feedback to distinguish what is useful from what is noise. Sometimes, criteria like an unusual absence of volume might be necessary to catch issues, but you need to adjust for false positives. Some alerts are crucial but don’t require 24/7 paging. Remember, your first responders are your closest business partners; respect their work-life balance and ensure critical alerts aren’t lost in a flood of unnecessary ones.

Simple Network Management Protocol

SNMP, a protocol developed for managing devices, originated in the late 1980s through a collaborative effort involving multiple institutions.

SNMP supports both pull (GET) and push (TRAP) modes of signaling, making it highly versatile for a wide range of monitoring needs. Major hardware vendors have standardized around SNMP, effectively compelling its adoption for monitoring network equipment. While network vendors often provide their own proprietary solutions, these are typically closed systems, and SNMP remains the common denominator for interoperability. Due to this widespread standardization, SNMP has become a must-implement protocol in traditional IT environments, offering the unique ability to monitor every part of a network infrastructure. Despite the robust competition from platform-specific agents designed for Windows and Linux servers, SNMP’s universal applicability ensures its continued relevance as the single least common denominator in network monitoring.

Introduction to Net-SNMP and snmpd

Net-SNMP is the most prominent implementation of SNMP agents and tools for UNIX and Linux environments. Its roots trace back to the CMU SNMP project, developed at Carnegie Mellon University. Building on this foundation, significant development occurred at UC Davis, where the project transitioned into what we now know as Net-SNMP. Wes Hardaker played a pivotal role during this phase, overseeing substantial refinements and expansions that transformed it into a robust suite, including the widely used snmpd agent. By the early 2000s, Net-SNMP had firmly established itself as a leading implementation, synonymous with SNMP on Linux systems.

Net-SNMP is available for all major Linux distributions. I already chose Debian for the SBCs because FreeSWITCH prefers this distribution. I’ll install Net-SNMP on the Debian-based SBCs alongside FreeSWITCH for custom monitoring.

Enable additional repositories (optional)

The snmp-mibs-downloader package is part of the non-free repository because it downloads non-open source Management Information Base (MIB) files. Net-SNMP itself is free, and while we may or may not use the MIBs in our exercise, it’s beneficial to install the MIBs together with the user tools. MIBs are a useful aid to SNMP software, but there is no hard requirement to use them.

# Configure the contrib and non-free repos
sed -i 's/main/main contrib non-free/' /etc/apt/sources.list
# Update the package list
apt-get update

Install the software

apt install snmpd snmp smitools snmp-mibs-downloader

Configuring SNMP Agent for Non-Localhost Connections

It’s typical for server software to ship with a default localhost-only configuration as a safety measure to ensure the services are only externally reachable once you intend them to be.

To expose the SNMP agent (daemon) to non-localhost connections, you need to edit the /etc/snmp/snmpd.conf configuration file. Adjust the agentaddress line to specify the desired IP addresses.

Considerations:

Binding to All Interfaces:
- To bind the SNMP agent to all interfaces, specify 0.0.0.0 as the IP address. This is the most common configuration.
```
agentaddress 0.0.0.0
```
Binding to a Specific IP Address:
- If you prefer to bind to a specific IP address, you can do so. However, note that if the system’s IP address changes (e.g., via DHCP or manual re-IP), the SNMP agent will fail to start.
- It’s recommended to keep 127.0.0.1 in the list to allow localhost connections.
```
agentaddress 127.0.0.1,192.168.252.221
```
Security Considerations:
- On multi-homed systems connected to SIP Trunk networks, ensure that SNMP and SSH are not exposed to business partners. These protocols are intended for internal management only.

Configuring SNMP Community Strings and Access

To avoid frustrations during your project work, it’s crucial to address authentication and permissions from the beginning. While you can get basic SNMP queries working out of the box, more advanced tasks require proper access configuration.

Default Configuration: The default SNMP configuration restricts the public community string to a systemonly view:

view   systemonly  included   .1.3.6.1.2.1.1
view   systemonly  included   .1.3.6.1.2.1.25.1
rocommunity  public default -V systemonly
rocommunity6 public default -V systemonly
rouser authPrivUser authpriv -V systemonly

To access more than just the systemonly view, you need to create a new view and reconfigure the public community string.

Steps to Reconfigure the Public Community String for Read-Only All Access:

Define the “all” View:
- Edit the /etc/snmp/snmpd.conf file to include a new view that encompasses everything.
```
view    all    included   .1
```
Reconfigure the Public Community String:
- Modify the community string to use the new all view.
```
rocommunity  public  default  -V all
```
Restart the SNMP Service:
- After making the changes, restart the SNMP service to apply the new configuration.
```
sudo systemctl restart snmpd
```

Security Considerations:

For the scope of this project, I will use SNMPv2c for simplicity, avoiding the additional complexity of SNMPv3.
Be cautious with the public community string. In a production environment, it’s recommended to use a non-default community string or SNMPv3 for better security.
While SNMP was originally designed to allow both monitoring and configuration (using the SET method), it is primarily used for monitoring. For this project, I recommend configuring SNMP for read-only access.

Your First SNMP Walk

To begin interacting with your SNMP agent, you can use the snmpwalk command. This command allows you to query a range of information from the SNMP agent, providing a detailed view of the system’s status and configuration.

In the example below, we use the head command to display only the first few lines of output to keep it brief for the sake of the write-up. As you can see from the line count, there are over 5,000 lines of output.

root@LA-SBC:/etc/snmp# snmpwalk -v2c -c public localhost | head -n 5
iso.3.6.1.2.1.1.1.0 = STRING: "Linux LA-SBC 5.10.0-33-amd64 #1 SMP Debian 5.10.226-1 (2024-10-03) x86_64"
iso.3.6.1.2.1.1.2.0 = OID: iso.3.6.1.4.1.8072.3.2.10
iso.3.6.1.2.1.1.3.0 = Timeticks: (6913) 0:01:09.13
iso.3.6.1.2.1.1.4.0 = STRING: "Me "
iso.3.6.1.2.1.1.5.0 = STRING: "LA-SBC"
root@LA-SBC:/etc/snmp#
root@LA-SBC:/etc/snmp#
root@LA-SBC:/etc/snmp# snmpwalk -v2c -c public localhost | wc -l
5588
root@LA-SBC:/etc/snmp#

Extending Net-SNMP Agent with Custom Metrics

When it comes to extending the Net-SNMP agent with custom metrics, there are several approaches you can take. Here we’ll discuss these options, starting with MIB Modules:

MIB Modules

MIB Modules are essentially the standard way of exposing data through Net-SNMP. They represent the full integration pathway used to collect and expose metrics such as NIC, filesystem, and other core Linux metrics. Here are some key points about MIB Modules:

Language and Integration: MIB modules are generally written in the C programming language, which allows for close-to-the-metal performance and fine-grained control. These modules are compiled and then loaded by the Net-SNMP agent.
Usage and Documentation: There is extensive documentation provided by Net-SNMP on how to write, compile, and integrate these modules. This method is highly detailed and customizable, making it suitable for complex and large-scale integrations.
Typical Use Cases: Given the level of complexity and the integration effort required, this approach is often adopted by major hardware manufacturers or large organizations that need to integrate comprehensive monitoring capabilities across their products or infrastructure.
Overkill for Customizations: For localized customizations or simpler monitoring needs, MIB Modules are overkill. They require significant development resources and expertise in C programming.

In summary, while MIB Modules provide a powerful and flexible way to extend Net-SNMP, they are beyond the needs of smaller projects.

AgentX

AgentX is a protocol for delegating parts of the SNMP OID address space to sub-agents, enabling distributed management of SNMP metrics. It is:

A standard approach for mature software projects to expose SNMP metrics. Before building custom solutions, check whether the software you need to monitor already supports SNMP via AgentX.
A powerful tool for developing custom metrics, particularly when other SNMP extension mechanisms (e.g., extend, pass, or pass_persist) are insufficient. However, using AgentX for custom extensions is considered an advanced undertaking.

Below are additional considerations:

Zero-Config Delegation: AgentX enables nearly zero-configuration delegation of OID ranges to sub-agents via a Linux socket file or configurable UDP/TCP communication.
Custom Sub-Agents: If extending Net-SNMP through simpler mechanisms today, consider AgentX as your next step. Middleware solutions, such as those built with Python, can extract, transform, and manage metrics programmatically while integrating with AgentX.

Ownership and Permissions

Proper ownership and permissions for the AgentX socket file are critical for successful integration:

Ensure your application has write access to the socket file. Below is an example snmpd.conf configuration that sets group ownership and write permissions for the group freeswitch (the group under which our sub-agent runs).
Note the execute bit (x) is set on the directory, including for others. In UNIX, directory execute permissions allow traversal, which is essential.

Example `snmpd.conf` Configuration

# Set up AgentX socket with group ownership and permissions
agentXSocket /var/agentx/master agentXPerms 0770 0711 root freeswitch

pass_persist

The pass_persist directive in Net-SNMP offers a dynamic and flexible way to handle SNMP data by delegating control of a specific OID subtree to an external script. Here’s what you need to know:

Requires Developing a Script: The pass_persist protocol requires developing a script that can speak the Net-SNMP pass_persist protocol. Your script must handle more than just simple lookups; it must navigate the OID tree using SNMP semantics, responding to get and getnext requests, and maintain control over the subtree. This is essential for dynamic metrics, especially in tabular format.
Learning Opportunity: This hands-on approach provides a valuable learning experience: it forces you to understand and implement the SNMP protocol’s semantics yourself.
UNIX inetd Concept: This approach is similar to the UNIX inetd concept, which allows the implementation of network services merely by interfacing with standard input and output.
Most Accessible Method: pass_persist is the most accessible method for delegating portions of the OID tree to a sub-agent, which is critical for adding new metrics without needing to update the snmpd configuration.
Strategic Middle Ground: This method is ideal for those requiring precise control over which OIDs are used and how the hierarchy is defined. It allows for true space delegation, enabling dynamic metric data shipping without revisiting snmpd configuration for each individual metric.

Example snmpd.conf Configuration:

pass_persist .1.3.6.1.4.1.2021.255 /path/to/your_script.py

In this example, the pass_persist directive assigns the OID subtree rooted at .1.3.6.1.4.1.2021.255 to the specified script. This script is now responsible for handling all SNMP requests within that subtree.

extend

The extend directive in Net-SNMP is a straightforward method to integrate custom metrics into the SNMP agent, allowing you to extend its capabilities. Here’s what you need to know:

Good for Unsupported Metrics: The extend directive is ideal for when a metric is not directly supported by SNMP but can be retrieved and printed out via a command-line shell script.
Simple Implementation: Unlike more complex methods such as pass_persist or MIB Modules, the extend directive is relatively simple to implement. You just need to specify the command to be executed.
Limited Control and Flexibility: Each custom metric must be individually specified in the snmpd.conf file, and you don’t get to customize the entire OID, only the final part of it.
Generally Adequate: This method is best suited for exporting a few well-established, stable custom metrics that do not have dynamic or frequently changing requirements.

Example snmpd.conf Configuration:

# Extend SNMP with custom script
extend myCustomScript /path/to/my_script.sh

In this example, the extend directive assigns the script located at /path/to/my_script.sh to the identifier myCustomScript. The SNMP agent will execute this script whenever an SNMP request is made to that identifier.

The full OID will be structured as:

.1.3.6.1.4.1.8072.2.3.1.1.[index]

where [index] is a unique identifier for each extend instance. For example, if myCustomScript is the first instance, the OID might be:

.1.3.6.1.4.1.8072.2.3.1.1.1

Understanding Tabular Data in SNMP

The concept of tabular data in SNMP began with RFC 1066 (Aug 1988), laying the groundwork for structuring managed objects with tables. RFC 1213 (Mar 1991), expanded these definitions in MIB-II, providing a comprehensive framework for network management. Additionally, RFC 1155 (May 1990), known as SMI (Structure of Management Information), formalized the structure of management information, contributing to the standardization of tabular data in SNMP.

These RFCs collectively established the foundation for using tabular data in SNMP.

The Logical Structure of SNMP Tables

At the core of an SNMP table is its base OID, which identifies the table itself. Columns within the table are further defined as offsets from this base OID, and each row is indexed by a unique identifier appended to these column-specific OIDs. For example, in the ifTable (interface table) defined in the IF-MIB, we see:

Base OID: .1.3.6.1.2.1.2.2.1
Column OIDs: Each column has a specific suffix, such as .1 for ifIndex, .2 for ifDescr, .7 for ifAdminStatus, etc.
Row Indexing: Rows are identified by appending a row index to the column OID, e.g., .1.3.6.1.2.1.2.2.1.2.1 refers to the ifDescr (description) of the first interface.

This hierarchical structure is key to navigating SNMP tables programmatically and visually.

Visualizing the Hierarchy

The following example from an SNMP walk demonstrates the canonical structure of the ifTable, focusing on three key columns: ifIndex (index), ifDescr (description), and ifAdminStatus (administrative status).

SNMP Walk Output:

.1.3.6.1.2.1.2.2.1.1.1 = INTEGER: 1
.1.3.6.1.2.1.2.2.1.1.2 = INTEGER: 2
.1.3.6.1.2.1.2.2.1.2.1 = STRING: lo
.1.3.6.1.2.1.2.2.1.2.2 = STRING: Red Hat, Inc. Device 0001
.1.3.6.1.2.1.2.2.1.7.1 = INTEGER: up(1)
.1.3.6.1.2.1.2.2.1.7.2 = INTEGER: up(1)

Breaking this down:

Base OID: .1.3.6.1.2.1.2.2.1
Columns:
- .1 (ifIndex) identifies the interface.
- .2 (ifDescr) provides a textual description of the interface.
- .7 (ifAdminStatus) indicates whether the interface is administratively up or down.
Rows: The index at the end (e.g., 1, 2) corresponds to a specific interface.

Tabular Representation:

Index (.1)	Description (.2)	Admin Status (.7)
1	lo	up (1)
2	Red Hat, Inc. Device 0001	up (1)

Key Takeaways

Established Standards: The structure of SNMP tables follows standards defined in RFCs and MIBs, ensuring predictable and consistent data access.
Hierarchical Mapping: Base OIDs anchor tables, while column and row indices extend these anchors to form a complete data path.

Application to Modern Monitoring

Modern monitoring platforms leverage this hierarchical SNMP model to automatically discover and integrate system resources, both at initial deployment and over time. This dynamic discovery is crucial for ensuring continuous and accurate monitoring, as it eliminates the need for manual configuration updates when systems change.

For example:

Scenario 1: A new filesystem is created on a server. The monitoring platform should automatically detect the addition of the filesystem and begin applying the correct monitoring policies (e.g., disk space usage, inode usage).
Scenario 2: A new network interface card (NIC) is added. The system should detect the NIC, fetch its status, and monitor traffic accordingly.

Proof of Concept: Custom Metrics Integration

Building on insights from my initial exploration of Net-SNMP, I am now set to embark on a proof of concept (PoC) for integrating custom metrics into and through Net-SNMP to SolarWinds.

Choosing an extension mechanism: pass_persist

When evaluating ways to extend Net-SNMP with custom metrics, I chose the pass_persist option. It provides the necessary control over the OID space, supports hierarchical structures for tables, and allows dynamic updates to items without requiring changes to the snmpd configuration.

Initial Considerations

The main challenge is developing a script that adheres to SNMP semantics. To address this, I’ll begin with a prototyping approach, focusing on establishing a reliable method for passing data to Net-SNMP. Using dummy data in the initial phase ensures the framework is solid and minimizes the risk of wasted effort from false starts.

Prototyping with Dummy Data

Prototyping is a practical strategy when requirements are evolving or unclear. In this case, I need to integrate custom metrics into Net-SNMP while preparing for future SolarWinds integration. Given the uncertainty and investigative nature of the SolarWinds phase, prototyping ensures flexibility and minimizes wasted effort if I have to circle back. It also helps to break the problem into manageable chunks, allowing me to start with minimal effort and adapt as needed.

To prototype effectively, I’ll use the filesystem as a simple storage backend. This decision allows me to focus on two key tasks:

Implementing the pass_persist Protocol: I’ll concentrate on refining the pass_persist protocol—writing, debugging, and iterating the script, with log files and packet captures guiding the process. The goal is to stabilize the script’s functionality, ensuring it can parse requests, navigate the tree structure, and respond appropriately with detailed logs for debugging.
Mastering the OID Hierarchy for Tabular Data: Once the script is solid, attention will shift to optimizing the OID hierarchy. Using basic disk files enables rapid iteration on the data structure without disrupting the script’s stability, ensuring efficient experimentation with different hierarchies.

Ultimately, the entire approach is a prototype solution aimed at solidifying my methodology before deploying the actual custom metrics.

Implementing the pass_persist Protocol

The pass_persist directive in Net-SNMP is a powerful tool that delegates control over an OID subtree to an external script. Unlike the simpler pass, which executes a new script for each SNMP request, pass_persist keeps the script running continuously, reducing overhead and benefiting dynamic or frequently updated metrics.

A Bash Implementation

The following Bash script implements the pass_persist protocol. It dynamically traverses an OID tree and supports PING, get, getnext, and getbulk commands. Logs are written to /tmp/read_oid_persist.log for debugging purposes.

#!/bin/bash
LOG_FILE="/tmp/read_oid_persist.log"

echo "$(date) - Starting read_oid_persist.sh" >> "$LOG_FILE"

while true; do
    read CMD
    echo "$(date) - Command: $CMD" >> "$LOG_FILE"

    if [ "$CMD" == "PING" ]; then
        echo "PONG"
        echo "$(date) - Responding to PING with PONG" >> "$LOG_FILE"
    else
        read OID
        echo "$(date) - OID: $OID" >> "$LOG_FILE"

        if [ "$CMD" == "get" ]; then
            if [ -f "/oids/$OID" ]; then
                VALUE=$(cat "/oids/$OID")
                TYPE="STRING"
                if [[ "$VALUE" =~ ^[0-9]+$ ]]; then
                    TYPE="INTEGER"
                fi
                echo "$OID"
                echo "$TYPE"
                echo "$VALUE"
                echo "$(date) - Returning value: $VALUE for OID: $OID, Type: $TYPE" >> "$LOG_FILE"
            else
                echo "NONE"
                echo "$(date) - OID not found: $OID" >> "$LOG_FILE"
            fi
        elif [ "$CMD" == "getnext" ]; then
            NEXT_OID=$(ls -A /oids | grep -A 1 "^$OID\$" | tail -n 1)
            if [ -f "/oids/$NEXT_OID" ]; then
                VALUE=$(cat "/oids/$NEXT_OID")
                TYPE="STRING"
                if [[ "$VALUE" =~ ^[0-9]+$ ]]; then
                    TYPE="INTEGER"
                fi
                echo "$NEXT_OID"
                echo "$TYPE"
                echo "$VALUE"
                echo "$(date) - Returning next OID: $NEXT_OID with value: $VALUE, Type: $TYPE" >> "$LOG_FILE"
            else
                echo "NONE"
                echo "$(date) - Next OID not found after: $OID" >> "$LOG_FILE"
            fi
        elif [ "$CMD" == "getbulk" ]; then
            read NON_REPEATERS MAX_REPETITIONS
            RESULTS=()
            CURRENT_OID=$OID

            for (( i=0; i<$NON_REPEATERS; i++ )); do
                if [ -f "/oids/$CURRENT_OID" ]; then
                    VALUE=$(cat "/oids/$CURRENT_OID")
                    TYPE="STRING"
                    if [[ "$VALUE" =~ ^[0-9]+$ ]]; then
                        TYPE="INTEGER"
                    fi
                    RESULTS+=("$CURRENT_OID")
                    RESULTS+=("$TYPE")
                    RESULTS+=("$VALUE")
                fi
                CURRENT_OID=$(ls -A /oids | grep -A 1 "^$CURRENT_OID\$" | tail -n 1)
            done

            for (( i=0; i<$MAX_REPETITIONS; i++ )); do
                if [ -f "/oids/$CURRENT_OID" ]; then
                    VALUE=$(cat "/oids/$CURRENT_OID")
                    TYPE="STRING"
                    if [[ "$VALUE" =~ ^[0-9]+$ ]]; then
                        TYPE="INTEGER"
                    fi
                    RESULTS+=("$CURRENT_OID")
                    RESULTS+=("$TYPE")
                    RESULTS+=("$VALUE")
                fi
                CURRENT_OID=$(ls -A /oids | grep -A 1 "^$CURRENT_OID\$" | tail -n 1)
            done

            for RESULT in "${RESULTS[@]}"; do
                echo "$RESULT"
            done

            echo "$(date) - Returning bulk results for OID: $OID" >> "$LOG_FILE"
        else
            echo "NONE"
            echo "$(date) - Unknown command: $CMD" >> "$LOG_FILE"
        fi
    fi
done

Introducing the Dummy Data Script

This script uses plain disk files for SNMP responses, ensuring the data structure is self-documenting and easy to replicate. It allows for quick iterations and clear debugging, validating the pass_persist implementation without the need for live metrics.

#!/bin/bash

# --------------------------------------------------------------------
# SNMP Table Prototype Script: Gateway Status and Call Counts
#
# This script creates a single SNMP table with the following structure:
# - Base OID: .1.3.6.1.4.1.9999.10701.1 (gatewayTable)
# - Columns:
#     .1 -> gatewayIndex     (INTEGER: Index of the gateway)
#     .2 -> gatewayDescr     (STRING: Description of the gateway)
#     .3 -> gatewayStatus    (INTEGER: 1=UP, 2=DOWN)
#     .4 -> gatewayCalls     (INTEGER: Number of active calls)
#
# Data rows:
#   - Index 1: chi-sbc, UP, 23 calls
#   - Index 2: la-sbc, DOWN, 15 calls
#
# This layout mimics the structure of IF-MIB tables and ensures all anchor
# OIDs are explicitly created for proper operation with pass_persist.
# --------------------------------------------------------------------

# Step 1: Create the anchor for the table base OID
mkdir -p /oids
echo "gatewayTable" > /oids/.1.3.6.1.4.1.9999.10701.1

# --------------------------------------------------------------------
# Column Anchors
# --------------------------------------------------------------------
echo "gatewayIndex"     > /oids/.1.3.6.1.4.1.9999.10701.1.1
echo "gatewayDescr"     > /oids/.1.3.6.1.4.1.9999.10701.1.2
echo "gatewayStatus"    > /oids/.1.3.6.1.4.1.9999.10701.1.3
echo "gatewayCalls"     > /oids/.1.3.6.1.4.1.9999.10701.1.4

# --------------------------------------------------------------------
# Row Definitions
# --------------------------------------------------------------------

# Row 1: chi-sbc
echo "1"                > /oids/.1.3.6.1.4.1.9999.10701.1.1.1    # Row 1, Column 1 (Index)
echo "chi-sbc"          > /oids/.1.3.6.1.4.1.9999.10701.1.2.1    # Row 1, Column 2 (Description)
echo "1"                > /oids/.1.3.6.1.4.1.9999.10701.1.3.1    # Row 1, Column 3 (Status: UP)
echo "23"               > /oids/.1.3.6.1.4.1.9999.10701.1.4.1    # Row 1, Column 4 (Calls)

# Row 2: la-sbc
echo "2"                > /oids/.1.3.6.1.4.1.9999.10701.1.1.2    # Row 2, Column 1 (Index)
echo "la-sbc"           > /oids/.1.3.6.1.4.1.9999.10701.1.2.2    # Row 2, Column 2 (Description)
echo "2"                > /oids/.1.3.6.1.4.1.9999.10701.1.3.2    # Row 2, Column 3 (Status: DOWN)
echo "15"               > /oids/.1.3.6.1.4.1.9999.10701.1.4.2    # Row 2, Column 4 (Calls)

# Completion Message
echo "SNMP table structure created successfully."

# --------------------------------------------------------------------
# Notes:
# - This table uses OIDs under the private enterprise tree (1.3.6.1.4.1).
# - Example `snmpwalk` command for testing:
#   snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
# --------------------------------------------------------------------

Validating the pass_persist Implementation

A tremendous amount of developer testing went into this, involving hundreds of test invocations, detailed log analysis, and live packet captures. This was a rigorous process aimed at ensuring everything worked as expected across different scenarios.

The following section presents just the final validation steps, which confirm that the implementation is functioning as intended. Keep in mind, this is the culmination of extensive testing and iterative improvements.

Here’s the configured snmpd.conf line for the pass_persist directive:

root@NY-SBC:/# tail -n 1 /etc/snmp/snmpd.conf
pass_persist .1.3.6.1.4.1.9999.10701.1 /usr/bin/bash /usr/local/scripts/read_oid_persist.sh
root@NY-SBC:/#

Let’s try some SNMP walks of the whole table as well as each column:

root@NY-SBC:/#  snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 23
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 15
root@NY-SBC:/#
root@NY-SBC:/#
root@NY-SBC:/#  snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.1
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
root@NY-SBC:/#
root@NY-SBC:/#
root@NY-SBC:/#  snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.2
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
root@NY-SBC:/#
root@NY-SBC:/#  snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
root@NY-SBC:/#
root@NY-SBC:/#  snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.4
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 23
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 15
root@NY-SBC:/#

I also tried SNMP bulk walks, which correspond to the ‘GET TABLE’ idea we’ll eventually encounter in SolarWinds. These utilize a different mechanism at the SNMP protocol level and exercise a different code path in our pass_persist script:

root@NY-SBC:/# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 23
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 15
root@NY-SBC:/#
root@NY-SBC:/# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.1
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
root@NY-SBC:/#
root@NY-SBC:/# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.2
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
root@NY-SBC:/#
root@NY-SBC:/# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
root@NY-SBC:/#

Finally let’s try some random gets, walks, etc. against individual OIDs, just to see if we can uncover any issues:

root@NY-SBC:/# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3.1
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
root@NY-SBC:/#
root@NY-SBC:/# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3.1
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
root@NY-SBC:/#
root@NY-SBC:/# snmpget -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3.1
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
root@NY-SBC:/#
root@NY-SBC:/# snmpget -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.3.2
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
root@NY-SBC:/#
root@NY-SBC:/# snmpget -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1.4.2
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 15
root@NY-SBC:/#

Integration with SolarWinds

As part of the integration process for monitoring custom metrics from the SBCs into SolarWinds, I utilized SolarWinds’ UnDP (Universal Device Poller) tool to import the necessary SNMP data. This approach enabled me to retrieve a variety of metrics, including the dynamic discovery of connected SIP trunks.

UnDP Tool Overview

The UnDP tool offers flexibility for importing custom SNMP data, yet neither it nor SolarWinds’ main web UI support importing custom MIB files. To add a custom MIB to the monitoring system, you must submit the MIB file through SolarWinds support channels. They will then package it into a fleet-wide “database update,” as there is no supported way to perform a custom MIB import independently.

For this project, I leveraged the tool’s existing capabilities to import table-based data from the SBC using the “GET TABLE” functionality. Although the UnDP tool allows the definition of custom metrics without a MIB definition, there are limitations when dealing with custom metrics not covered by an imported MIB.

Challenges and Solutions

One of the main challenges I encountered was the lack of granular control over table formatting and labeling when using the “GET TABLE” feature. The UnDP tool does not offer per-column formatting options, which impacted how the data was displayed in the SolarWinds UI. Despite this limitation, I was able to configure table imports and implement essential monitoring functions, such as a working test alert for trunk status changes.

Although the tool does not fully align with the SNMP community’s standard for tabular data, it still allows for column-at-once polling, providing a reasonable compromise. With further adjustments, such as refining metric labeling and adjusting the display format, these limitations can be addressed to enhance the overall integration.

Conclusion

This proof of concept (PoC) demonstrates the successful integration of custom SNMP metrics with SolarWinds, meeting the immediate need for monitoring SIP trunk statuses. While it does not yet showcase advanced dashboards or fully customized metric displays, it sets a solid foundation for future enhancements.

I am eager to further explore SolarWinds’ capabilities and refine the configuration to create a more robust and dynamic monitoring solution. Given the opportunity to join your team, I am committed to mastering SolarWinds and delivering a polished, comprehensive integration.

Screenshots

All of the stock FreeSWITCH metrics are being polled (scalars, via AgentX, but defined in UnDP).

A sample of the tabular data imports via pass_persist, using column-at-once strategy (UnDP).

Status metric is successfully populating to SolarWinds UI as an actionable item, was able to define an alert around it.

Successfully triggered the alert on SIP Trunk Down.

Final Chapter: Consolidating Lessons Learned and Delivering a Robust Solution

To complete the live custom metric implementation promised in the PoC, I needed to implement and plumb the actual live metric lookups from FreeSWITCH—swapping out the fake metrics for real ones.

I was thrilled with the simplicity of my prototype script, and that it actually worked. It was well-suited for the PoC, so I wanted to carry it forward as much as possible, in order to avoid introducing a lot of regressions.

How to tackle the redesign? I had the insight that the prototype back-end was implemented through the filesystem. I made a strategic decision to surgically modify the prototype script at the exact points where it interfaced with the filesystem, doing a 1:1 swap of the filesystem-targeting mechanism with a call-out to an external script that I would write to mimic the interface fully.

I identified three key mechanisms where my prototype pass_persist script was interacting with the filesystem: -z (does file exist) checks in conditional logic, ls -A (directory listing of OIDs), and cat (retrieval of values).

I also decided to do the refactor in two phases: in the first phase, I migrated the traversal/tree navigational aspects of the pass_persist script (e.g., does an OID exist, what is the “next” OID following this one) but left the actual metric retrieval logic targeting the dummy back-end. Once I had this aspect locked in and validated, I made a backup copy of my work and started on the live metric lookups as the final phase.

Here’s the final directory layout and the scripts in their complete form:

root@NY-SBC:/usr/local/scripts# pwd
/usr/local/scripts
root@NY-SBC:/usr/local/scripts# ls -la
total 28
drwxr-xr-x  2 root root 4096 Jan 13 01:48 .
drwxr-xr-x 11 root root 4096 Jan  5 18:26 ..
-rwxr-xr-x  1 root root 3095 Jan 13 01:48 get_value.sh
-rwxr-xr-x  1 root root  519 Jan 12 21:58 list_oids.sh
-rwxr-xr-x  1 root root  214 Jan 13 01:24 oid_exists.sh
-rwxr-xr-x  1 root root 4386 Jan 13 01:23 read_oid_persist.sh
root@NY-SBC:/usr/local/scripts#

read_oid_persist.sh (main script)

#!/bin/bash
LOG_FILE="/tmp/read_oid_persist.log"

echo "$(date) - Starting read_oid_persist.sh" >> "$LOG_FILE"

# Define external function interfaces
oid_exists() {
    # Call an external script that checks if the OID exists
    /usr/local/scripts/oid_exists.sh "$1"
}

list_oids() {
    # Call an external script that lists OIDs in order
    /usr/local/scripts/list_oids.sh
}

get_value() {
    # Call an external script to get the value for the given OID
    /usr/local/scripts/get_value.sh "$1"
}

while true; do
    read CMD
    echo "$(date) - Command: $CMD" >> "$LOG_FILE"

    if [ "$CMD" == "PING" ]; then
        echo "PONG"
        echo "$(date) - Responding to PING with PONG" >> "$LOG_FILE"
    else
        read OID
        echo "$(date) - OID: $OID" >> "$LOG_FILE"

        if [ "$CMD" == "get" ]; then
            if oid_exists "$OID"; then
                VALUE=$(get_value "$OID")
                TYPE="STRING"
                if [[ "$VALUE" =~ ^[0-9][0-9]*$ ]]; then
                    TYPE="INTEGER"
                fi
                echo "$OID"
                echo "$TYPE"
                echo "$VALUE"
                echo "$(date) - Returning value: $VALUE for OID: $OID, Type: $TYPE" >> "$LOG_FILE"
            else
                echo "NONE"
                echo "$(date) - OID not found: $OID" >> "$LOG_FILE"
            fi
        elif [ "$CMD" == "getnext" ]; then
            NEXT_OID=$(list_oids | grep -A 1 "^$OID\$" | tail -n 1)

            # Safety net: Check if NEXT_OID is the same as the original OID
            if [ "$NEXT_OID" == "$OID" ] || [ -z "$NEXT_OID" ]; then
                NEXT_OID=""
            fi

            if oid_exists "$NEXT_OID"; then
                VALUE=$(get_value "$NEXT_OID")
                TYPE="STRING"
                if [[ "$VALUE" =~ ^[0-9][0-9]*$ ]]; then
                    TYPE="INTEGER"
                fi
                echo "$NEXT_OID"
                echo "$TYPE"
                echo "$VALUE"
                echo "$(date) - Returning next OID: $NEXT_OID with value: $VALUE, Type: $TYPE" >> "$LOG_FILE"
            else
                echo "NONE"
                echo "$(date) - Next OID not found after: $OID" >> "$LOG_FILE"
            fi
        elif [ "$CMD" == "getbulk" ]; then
            read NON_REPEATERS MAX_REPETITIONS
            echo "$(date) - Non-repeaters: $NON_REPEATERS, Max-repetitions: $MAX_REPETITIONS" >> "$LOG_FILE"

            RESULTS=()
            CURRENT_OID=$OID

            for (( i=0; i<$NON_REPEATERS; i++ )); do
                if oid_exists "$CURRENT_OID"; then
                    VALUE=$(get_value "$CURRENT_OID")
                    TYPE="STRING"
                    if [[ "$VALUE" =~ ^[0-9][0-9]*$ ]]; then
                        TYPE="INTEGER"
                    fi
                    RESULTS+=("$CURRENT_OID")
                    RESULTS+=("$TYPE")
                    RESULTS+=("$VALUE")
                else
                    RESULTS+=("NONE")
                fi
                CURRENT_OID=$(list_oids | grep -A 1 "^$CURRENT_OID\$" | tail -n 1)

                # Safety net for non-repeaters
                if [ "$CURRENT_OID" == "$OID" ] || [ -z "$CURRENT_OID" ]; then
                    CURRENT_OID=""
                fi
            done

            for (( i=0; i<$MAX_REPETITIONS; i++ )); do
                if oid_exists "$CURRENT_OID"; then
                    VALUE=$(get_value "$CURRENT_OID")
                    TYPE="STRING"
                    if [[ "$VALUE" =~ ^[0-9][0-9]*$ ]]; then
                        TYPE="INTEGER"
                    fi
                    RESULTS+=("$CURRENT_OID")
                    RESULTS+=("$TYPE")
                    RESULTS+=("$VALUE")
                else
                    RESULTS+=("NONE")
                fi
                CURRENT_OID=$(list_oids | grep -A 1 "^$CURRENT_OID\$" | tail -n 1)

                # Safety net for max repetitions
                if [ "$CURRENT_OID" == "$OID" ] || [ -z "$CURRENT_OID" ]; then
                    CURRENT_OID=""
                fi
            done

            for RESULT in "${RESULTS[@]}"; do
                echo "$RESULT"
            done

            echo "$(date) - Returning bulk results for OID: $OID" >> "$LOG_FILE"
        else
            echo "NONE"
            echo "$(date) - Unknown command: $CMD" >> "$LOG_FILE"
        fi
    fi
done

list_oids.sh (my ‘ls -A’ drop-in replacement)

#!/bin/bash

# Replacement for "ls" command to mimic directory contents with correct order

# Base OID
BASE_OID=".1.3.6.1.4.1.9999.10701.1"

# Get the number of gateways directly
NUM_GATEWAYS=$(sudo /usr/bin/fs_cli -x 'sofia status gateway' | grep gateways: | awk '{print $1}')

# Print OIDs in the required order
echo "$BASE_OID"

# Loop through static entries and dynamic rows
for ((i=1; i<=4; i++)); do
    echo "$BASE_OID.$i"

    for ((j=1; j<=NUM_GATEWAYS; j++)); do
        echo "$BASE_OID.$i.$j"
    done
done

oid_exists.sh (my “-z” bash if-file-exists conditional drop-in)

#!/bin/bash

OID="$1"

# Check if OID exists by looking for it in the output of list_oids.sh
if /usr/local/scripts/list_oids.sh | grep -q "^$OID$"; then
    exit 0  # OID found
else
    exit 1  # OID not found
fi

get_value.sh (my “cat” drop-in replacement)

#!/bin/bash

# Ensure the OID is provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 "
  exit 1
fi

# Define a function to retrieve the Nth gateway
get_nth_gateway() {
  local N=$1
  sudo fs_cli -x 'sofia status gateway' | grep '@' | awk '{print $1}' | sed -n "${N}p"
}

# Unpack the OID to determine which gateway and which metric
OID=$1

# Example OID structure: .1.3.6.1.4.1.9999.10701.1..
# Extract the metric type from the second-to-last part of the OID
METRIC_INDEX=$(echo "$OID" | awk -F'.' '{print $(NF-1)}')

# Extract the gateway index from the last part of the OID
GATEWAY_INDEX=$(echo "$OID" | awk -F'.' '{print $NF}')

# Define the anchor OIDs and their hard-coded return values
ANCHOR_OIDS=(
    ".1.3.6.1.4.1.9999.10701.1"
    ".1.3.6.1.4.1.9999.10701.1.1"
    ".1.3.6.1.4.1.9999.10701.1.2"
    ".1.3.6.1.4.1.9999.10701.1.3"
    ".1.3.6.1.4.1.9999.10701.1.4"
)

# Define the corresponding hard-coded values for the anchor OIDs
ANCHOR_VALUES=(
    "gatewayTable"
    "gatewayIndex"
    "gatewayDescr"
    "gatewayStatus"
    "gatewayCalls"
)

# Check if the OID is one of the anchor OIDs
for i in "${!ANCHOR_OIDS[@]}"; do
    if [ "$OID" == "${ANCHOR_OIDS[$i]}" ]; then
        # Return the hard-coded value corresponding to the anchor OID
        echo "${ANCHOR_VALUES[$i]}"
        exit 0
    fi
done

# Get the gateway name using the embedded function
GATEWAY=$(get_nth_gateway "$GATEWAY_INDEX")

# Extract the specific metric for this gateway
case "$METRIC_INDEX" in
    "1")
        # gatewayIndex (just return the gateway index)
        METRIC_NAME="gatewayIndex"
        METRIC_VALUE="$GATEWAY_INDEX"
        ;;
    "2")
        # gatewayDescr (use the DESCR for the description)
        METRIC_NAME="gatewayDescr"
        # Extract the description (the part after the "::")
        DESCR=$(echo "$GATEWAY" | sed 's/.*:://')
        METRIC_VALUE="$DESCR"
        ;;
    "3")
        # gatewayStatus (1=UP, 2=DOWN)
        METRIC_NAME="gatewayStatus"
        METRIC_VALUE=$(sudo /usr/bin/fs_cli -x "sofia status gateway $GATEWAY" | awk '/^Status/ {print $2}')

        # Convert numeric status to string (1=UP, 2=DOWN), but return numeric value
        if [ "$METRIC_VALUE" == "UP" ]; then
            METRIC_VALUE="1"
        elif [ "$METRIC_VALUE" == "DOWN" ]; then
            METRIC_VALUE="2"
        else
            METRIC_VALUE="UNKNOWN"  # Handle unexpected values
        fi
        ;;
    "4")
        # gatewayCalls (fetching active calls from sofia status)
        METRIC_NAME="gatewayCalls"
        # Extract the profile (the part before the "::")
        PROFILE=$(echo "$GATEWAY" | sed 's/::.*//')
        METRIC_VALUE=$(sudo /usr/bin/fs_cli -x "sofia status" | awk -v profile="$PROFILE" '$1 == profile {gsub(/[()]/, "", $5); print $5}')

        # If we did not find a value, set it to 0 (or handle as needed)
        if [ -z "$METRIC_VALUE" ]; then
            METRIC_VALUE="0"
        fi
        ;;
    *)
        echo "Unsupported metric type: $METRIC_INDEX"
        exit 2
        ;;
esac

echo "$METRIC_VALUE"

Validation

To ensure the robustness of the solution, I conducted extensive testing, covering a wide range of edge cases. The following examples are just a few key queries demonstrating that the solution is working as intended. This section reflects a fraction of the comprehensive validation efforts undertaken.

Idle system

root@NY-SBC:~# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0
root@NY-SBC:~#

With a call placed on the trunk to chi-sbc

root@NY-SBC:~# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0

With the SIP trunk NIC administratively forced to “DOWN” status on chi-sbc, after 30 seconds:

root@NY-SBC:~# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0
root@NY-SBC:~#
root@NY-SBC:~#

After defining a third SIP Trunk targeting a new peer SBC “slc-sbc” (which is not online):

root@NY-SBC:~# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.1.3 = INTEGER: 3
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "slc-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.3 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.3.3 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.3 = INTEGER: 0
root@NY-SBC:~#

After reverting the downed NIC back to “UP” state on chi-sbc:

root@NY-SBC:~# snmpwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.1.3 = INTEGER: 3
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "slc-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.3 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.3 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.3 = INTEGER: 0
root@NY-SBC:~#

Test via the snmpbulkwalk utility to exercise GET BULK code path:

root@NY-SBC:~# snmpbulkwalk -v2c -c public localhost .1.3.6.1.4.1.9999.10701.1
SNMPv2-SMI::enterprises.9999.10701.1.1 = STRING: "gatewayIndex"
SNMPv2-SMI::enterprises.9999.10701.1.1.1 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.1.2 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.1.3 = INTEGER: 3
SNMPv2-SMI::enterprises.9999.10701.1.2 = STRING: "gatewayDescr"
SNMPv2-SMI::enterprises.9999.10701.1.2.1 = STRING: "slc-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.2 = STRING: "chi-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.2.3 = STRING: "la-sbc"
SNMPv2-SMI::enterprises.9999.10701.1.3 = STRING: "gatewayStatus"
SNMPv2-SMI::enterprises.9999.10701.1.3.1 = INTEGER: 2
SNMPv2-SMI::enterprises.9999.10701.1.3.2 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.3.3 = INTEGER: 1
SNMPv2-SMI::enterprises.9999.10701.1.4 = STRING: "gatewayCalls"
SNMPv2-SMI::enterprises.9999.10701.1.4.1 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.2 = INTEGER: 0
SNMPv2-SMI::enterprises.9999.10701.1.4.3 = INTEGER: 0
root@NY-SBC:~#

Closing Thoughts

This project demonstrates my ability to design and implement a custom SNMP monitoring solution for FreeSWITCH metrics, emphasizing a strategic and modular approach. From initial prototyping with dummy data to integrating live metrics, it showcases my dedication to building effective, real-world solutions.

The thorough testing and validation efforts reflect my focus on ensuring robustness and reliability. My determination and commitment to follow-through have been crucial in delivering this project.

I look forward to discussing the project further and appreciate the opportunity to demonstrate my approach to systems administration challenges with creativity and precision.

Multi-City VoIP Network Implementation

2025-01-06T03:00:00-05:00

Background

In a previous project, I deployed a barebones PBX on Metaswitch Rhino TAS. While the Metaswitch platform itself was impressive, its bundled sample applications were rudimentary and intended only as source code examples. These sample applications lacked logs, metrics endpoints, and documentation. The key takeaway? Metaswitch Rhino TAS is a platform to build on, not anything flight-ready.

Overview

Using FreeSWITCH and Asterisk—both well-regarded platforms—I’ve designed a multi-city VoIP network connecting New York, Los Angeles, and Chicago. Each site combines an Asterisk PBX for local functionality with a dedicated FreeSWITCH-based SBC to interconnect the sites over SIP trunks.

My intention is to mimic a real-world setup, emphasizing modularity and control over call flow using Back-to-Back User Agent (B2BUA) principles. Although this system operates in a closed environment without PSTN integration, the SBC lays a solid foundation for external connectivity if needed. This setup demonstrates industry-relevant skills in telephony, systems administration, and network design.

Focus

This project isn’t about presenting a polished, Solutions Engineer-level blueprint, nor is it an attempt to teach industry professionals about their field. It’s about me devising and building an environment with enough complexity to showcase my hands-on approach to solving real-world systems administration challenges—and to prove that I’m someone you can trust to run your systems.

In my earlier project, I set up a barebones PBX on a single network segment—simple and straightforward. Now, I’m diving into something considerably more challenging. This project is designed to test and demonstrate my skills, showcasing my commitment to craft and determination to succeed.

Platform

Commercial SBC solutions are designed for enterprise environments, focusing on reliability, performance, and predictable features. Open-source projects, on the other hand, often aim to cover diverse telephony use cases rather than specializing exclusively in SBC functionality.

In my search for an open-source SBC, I found projects with SBC capabilities that varied in focus and maturity. After careful consideration, I chose FreeSWITCH for its powerful feature set, active community, and robust developer support. It exceeded my requirements, offering a dependable and comprehensive solution without compromise.

For PBX functionality, Asterisk was the natural choice. Its proven reliability allowed me to separate PBX and SBC roles effectively, ensuring a straightforward and dependable solution for local telephony.

Layout

Local Infrastructure

Location	Network	SBC IP	PBX IP
NY	192.168.254.0/24	192.168.254.221	192.168.254.222
CHI	192.168.253.0/24	192.168.253.221	192.168.253.222
LA	192.168.252.0/24	192.168.252.221	192.168.252.222

Dedicated Circuits

SIP Trunk	Trunk ID	Network	NY-SBC IP	LA-SBC IP	CHI-SBC IP
NY ↔ LA	Trunk 10	10.10.10.0/30	10.10.10.1	10.10.10.2
NY ↔ CHI	Trunk 20	10.10.20.0/30	10.10.20.1		10.10.20.2
LA ↔ CHI	Trunk 30	10.10.30.0/30		10.10.30.1	10.10.30.2

In case of a SIP trunk failure, calls are automatically rerouted through the alternate city.

Trunk Attachment

Determining whether to route the SIP trunks through the internal network via a firewall and NAT, or directly attach them to the SBC using a multi-homing strategy requires careful consideration of security, performance, network architecture, management complexity, and troubleshooting efficiency.

The Standard Approach: Routed Network with NAT

In the standard configuration, the SIP trunk is routed through the corporate network, typically via a firewall or router, with NAT applied. In this setup, the provider’s SIP traffic is directed to an external IP address on the firewall’s interface, which may be a private IP address if using a dedicated circuit (such as a point-to-point connection). The firewall then performs NAT to forward this traffic to the SBC’s internal IP address. Similarly, for outbound SIP traffic, the SBC uses the internal network to reach the provider, with the firewall translating the source IP to its external IP (again, typically a private IP in the case of a dedicated circuit), allowing the response to be correctly routed back.

This method is widely used because it acknowledges the firewall’s traditional role in perimeter security, effectively isolating the SBC from direct exposure to the external SIP trunk network. It also simplifies management by leveraging existing network infrastructure and firewall policies already in place for other services. The primary benefit is that the firewall handles both inbound and outbound traffic, ensuring security, address translation, and routing. The SBC does not need to sit on multiple networks and communicates to and from its peers via a single IP address.

However, deployment and troubleshooting can be delayed when different teams need to coordinate, especially when vendor support is required.

The Multi-Homing Approach: Direct Attachment to the SBC

Alternatively, the multi-homing strategy involves directly attaching the SIP trunk to the SBC, effectively treating the SBC as the gateway to the external network. This approach consolidates control within the SBC, positioning it as the central component for all SIP traffic, with fewer dependencies on external routers or firewalls for SIP management.

The appeal of multi-homing lies in the potential for simplified operations. By routing SIP traffic directly through the SBC, you centralize control, ensuring all troubleshooting and management can be done from a single system. This eliminates the need to coordinate with other network devices and reduces the complexity of dealing with multiple points of failure. Moreover, because the SBC handles the routing internally, it can provide a more streamlined and direct path for SIP communication, enhancing performance by reducing intermediary network hops.

However, this approach is not without its complexities. The SBC now takes on responsibilities traditionally handled by routers and firewalls. While this offers full control, it also increases the configuration complexity, requiring deeper engagement with FreeSWITCH to ensure proper setup. Additionally, this approach demands a higher level of attention to the SBC’s security and performance, as it becomes directly connected to the external network.

Decision: Streamlined Operations through Multi-Homing

While the multi-homing strategy requires more effort initially, it ultimately streamlines operations, providing a more efficient operation. This approach simplifies management by consolidating control into the SBC, where all key configurations can be handled from a single point.

With this setup, only one skill—managing the SBC—needs to be mastered. There’s no need to hand off tasks between teams or bounce between the SBC, firewall, router, and other devices. Deployments, moves, adds, and changes can be done directly within the SBC, eliminating delays and inefficiencies associated with multiple handoffs. Troubleshooting is also simplified, as issues can be diagnosed and resolved from a single system, reducing complexity and speeding up resolution times by minimizing the need for cross-team coordination.

Key considerations for adopting the multi-homing strategy include:

Consolidated Operations:
- Centralized control within the SBC.
- Streamlined management and troubleshooting into a single system.
- Simplified deployments, moves, adds, and changes within the SBC, reducing delays and inefficiencies.
Coordination Delays:
- Eliminated the need for handoffs between different teams or devices, minimizing potential deployment and troubleshooting delays.
- Reduced complexity by having a single point of control and responsibility.
Finger-Pointing:
- Reduced the likelihood of finger-pointing and evasive responses during troubleshooting.
- Simplified issue resolution with a clear, single point of focus, speeding up response times and minimizing miscommunication.

In summary, while the multi-homing strategy requires more effort initially, it brings significant long-term benefits by consolidating operations, minimizing coordination delays, and reducing the potential for finger-pointing. This results in a more efficient and streamlined operation, providing a cohesive VoIP infrastructure with minimal external dependencies.

Network bindings

root@NY-SBC:~# ip addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens18:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether bc:24:11:e6:67:38 brd ff:ff:ff:ff:ff:ff
    altname enp0s18
    inet 192.168.254.221/24 brd 192.168.254.255 scope global ens18
       valid_lft forever preferred_lft forever
    inet6 fe80::be24:11ff:fee6:6738/64 scope link
       valid_lft forever preferred_lft forever
3: ens19:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether bc:24:11:03:1f:36 brd ff:ff:ff:ff:ff:ff
    altname enp0s19
    inet 10.10.10.1/30 brd 10.10.10.3 scope global ens19
       valid_lft forever preferred_lft forever
    inet6 fe80::be24:11ff:fe03:1f36/64 scope link
       valid_lft forever preferred_lft forever
4: ens20:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether bc:24:11:42:e4:ad brd ff:ff:ff:ff:ff:ff
    altname enp0s20
    inet 10.10.20.1/30 brd 10.10.20.3 scope global ens20
       valid_lft forever preferred_lft forever
    inet6 fe80::be24:11ff:fe42:e4ad/64 scope link
       valid_lft forever preferred_lft forever
root@NY-SBC:~#

SIP Profiles

freeswitch@NY-SBC> sofia status
                     Name          Type                                       Data      State
=================================================================================================
            external-ipv6       profile                   sip:mod_sofia@[::1]:5080      RUNNING (0)
                  trunk20       profile              sip:mod_sofia@10.10.20.1:5060      RUNNING (0)
         trunk20::chi-sbc       gateway                      sip:ny-sbc@10.10.20.2      REGED
                 external       profile           sip:mod_sofia@50.48.215.192:5080      RUNNING (0)
    external::example.com       gateway                    sip:joeuser@example.com      NOREG
       lab4.decoursey.com         alias                                    trunk10      ALIASED
            internal-ipv6       profile                   sip:mod_sofia@[::1]:5060      RUNNING (0)
                  trunk10       profile              sip:mod_sofia@10.10.10.1:5060      RUNNING (0)
          trunk10::la-sbc       gateway                      sip:ny-sbc@10.10.10.2      REGED
                 internal       profile         sip:mod_sofia@192.168.254.221:5060      RUNNING (0)
=================================================================================================
6 profiles 1 alias

freeswitch@NY-SBC>

FreeSWITCH Multi-Homed Configuration

My initial approach to FreeSWITCH’s multi-homed setup involved configuring the NICs for the internal and SIP trunk networks, and then modifying the default FreeSWITCH configuration to define the required endpoints and parameters. However, this strategy didn’t work as expected.

There are several places in the default configuration where items like source IP and signaling IP addresses are hand-configured. In a multi-homed scenario, what’s correct in one context can be wrong in another. Therefore, simply editing the default configuration files (SIP profiles) isn’t sufficient.

The breakthrough came in defining distinct SIP profiles for each network segment and ensuring that remote gateways are affiliated with the correct network they are reachable through.

SIP Profiles

In FreeSWITCH, SIP profiles define how FreeSWITCH communicates with devices on specific network segments. A SIP profile is tied to a network interface (or IP address) and dictates how SIP traffic is handled on that segment. For example:

Internal Profile: Handles communication with internal devices or PBXs, usually bound to a private IP address on the local network.
SIP Trunk Profiles: Each SIP trunk network requires its own unique SIP profile, which directly binds to the IP assigned to that SIP trunk network. This profile is aligned with the network facing the specific SIP trunk and ensures that the traffic is routed correctly.

Gateway Definitions

Gateway Definitions in FreeSWITCH specify how the system connects to other SIP endpoints, such as remote PBXs, other SBCs, or SIP providers. Each gateway needs to be associated with a SIP profile, and this relationship ensures traffic is routed through the correct network interface. For example, a gateway might be defined to point to a remote SBC, specifying the SBC’s IP address or hostname, the necessary authentication credentials, and the associated SIP profile.

The Key Relationship: SIP Profiles and Gateways

The most important aspect of working with multiple network segments in FreeSWITCH is the relationship between SIP profiles and Gateway Definitions. Each gateway must be associated with the appropriate SIP profile that defines the network interface it should use. This is crucial because SIP traffic must be routed through the correct network interface—whether it’s the internal network or a dedicated SIP trunk network.

If you’re connecting to a remote SIP trunk over a secondary NIC, you would:

Create a SIP Trunk Profile bound to the IP of the secondary NIC.
Define a gateway within that profile, pointing to the remote endpoint (e.g., the remote SBC or SIP provider).
Ensure the gateway’s definition matches the SIP Trunk Profile’s network interface to correctly route the traffic.

Conclusion

In a multi-homed environment, especially when facing both internal and SIP trunk networks, FreeSWITCH must be explicitly configured to handle each network segment. This involves creating a unique SIP profile for each network interface and ensuring that each Gateway Definition points to the appropriate profile.

Dialplan

The FreeSWITCH dialplans are organized around where the traffic is presenting from.

Inbound arrivals

The below NY-SBC configuration is for handling calls arriving via Trunk 10, the circuit from LA. In general, it’s expected this should be calls destined for our NY-PBX extensions, and the plan is to hand these to NY-PBX. However, these might also be CHI-PBX destined calls that were failed over, and in that eventuality, we route those calls to CHI-SBC.

root@NY-SBC:/etc/freeswitch/dialplan# cat trunk10.xml

root@NY-SBC:/etc/freeswitch/dialplan#

Outbound dials

For outbound dials I ended up calling out to a Lua script.

root@NY-SBC:/etc/freeswitch/dialplan# cat public/01_route_to_la.xml

root@NY-SBC:/etc/freeswitch/dialplan#

This is that script.

root@NY-SBC:/etc/freeswitch/dialplan# cat /usr/share/freeswitch/scripts/route_to_la.lua
-- Initialize the FreeSWITCH API object
local api = freeswitch.API()

-- Function to check gateway status
local function check_gateway_status(gateway_name)
    local status = api:execute("sofia", "status gateway " .. gateway_name)
    if string.match(status, "Status%s+UP") then
        return "UP"
    else
        return "DOWN"
    end
end

-- Retrieve the destination number from the session
local destination_number = session:getVariable("destination_number")

if not destination_number then
    freeswitch.consoleLog("ERROR", "Destination number is nil. Unable to proceed.\n")
    session:hangup("NORMAL_TEMPORARY_FAILURE")
    return
end

-- Define the primary and secondary gateways
local primary_gateway = { name = "la-sbc", dialstring = "sofia/gateway/la-sbc/" .. destination_number }
local secondary_gateway = { name = "chi-sbc", dialstring = "sofia/gateway/chi-sbc/" .. destination_number }

-- Check the status of the primary gateway
local primary_status = check_gateway_status(primary_gateway.name)
freeswitch.consoleLog("INFO", "Primary gateway '" .. primary_gateway.name .. "' status: " .. primary_status .. "\n")

if primary_status == "UP" then
    -- Route the call through the primary gateway
    freeswitch.consoleLog("INFO", "Routing through primary gateway: " .. primary_gateway.name .. "\n")
    session:execute("bridge", primary_gateway.dialstring)
else
    -- If primary is down, check the secondary gateway
    local secondary_status = check_gateway_status(secondary_gateway.name)
    freeswitch.consoleLog("INFO", "Secondary gateway '" .. secondary_gateway.name .. "' status: " .. secondary_status .. "\n")

    if secondary_status == "UP" then
        -- Route the call through the secondary gateway
        freeswitch.consoleLog("INFO", "Routing through secondary gateway: " .. secondary_gateway.name .. "\n")
        session:execute("bridge", secondary_gateway.dialstring)
    else
        -- Neither gateway is up; try the primary and be done with it
        freeswitch.consoleLog("WARNING", "Both gateways are down. Attempting primary as a last resort.\n")
        session:execute("bridge", primary_gateway.dialstring)
    end
end

-- If no gateways succeed, hang up
if session:getVariable("originate_disposition") ~= "SUCCESS" then
    freeswitch.consoleLog("ERROR", "All attempts failed. Hanging up.\n")
    session:hangup("NORMAL_TEMPORARY_FAILURE")
end
root@NY-SBC:/etc/freeswitch/dialplan#

Failover to Backup SIP Trunks

During the PoC, automatic backup failover was right at my fingertips when setting up the SIP trunks, so it got pulled in. While it was an easy win, any push to production will start with a discussion of likely failure points, and the actual first-pass redundancy design may focus elsewhere.

Each SBC is responsible for routing calls to a remote city via its SIP trunk. However, if the trunk is down, should we route the call through another city?

The challenge here is that we can’t know for certain whether this secondary routing will help. It ultimately depends on the root cause of the problem—whether it’s a network issue affecting the entire site or just a specific SIP trunk that’s down.

Guiding Principles and Analysis

Failover engineering has some guiding principles, with one of the most important being caution around blindly failing over to another site. That other site may not be in any better posture to handle the traffic, and routing requests without understanding the underlying issue can lead to further complications. The nuances and gotchas in failover design are varied, but the general takeaway is to approach such decisions thoughtfully and with an understanding of potential risks.

After analysis, SIP’s end-to-end setup design offers an advantage in this case. If the call can’t be completed all the way to the callee, regardless of how many hops are involved, the failure will propagate all the way back to the original bridging attempt. It’s not as if the call would be accepted at an intermediary site, only to be mishandled there.

FreeSWITCH Failover Implementation

FreeSWITCH’s documentation provides a straightforward approach for failover, using the | separator to attempt a call on a secondary trunk if the primary one fails:

 application="bridge" data="sofia/gateway/primary/dialstring|sofia/gateway/secondary/dialstring"/>

During testing, this mechanism works well when the primary trunk is down. However, if the primary trunk is active and the callee rejects the call, FreeSWITCH’s failover mechanism attempts to bridge the call again, causing an unwanted re-ring for the party who just rejected it. I tried using scripting constructs to address this by evaluating the failure reason, but faced a race condition where I couldn’t get the failure reason in time to make an informed decision. While I haven’t ruled out a workable solution with developers or configuration experts, I have shifted focus for now.

Monitoring and Fail-Open Strategy

I was committed to implementing some form of failover, even if not perfect. Fortunately, my systems constantly monitor trunk status via SIP OPTIONS—I have the pings configured to occur every 15 seconds, providing real-time status feedback. Using this data, I can check via API call whether a trunk is up or down before attempting to route the call through it and then prioritize relay attempts likely to succeed. I even have some logic to try to “fail open” in the event of a status lookup failure, in which case the call is attempted down the primary route.

Conclusion: Effective Failover Mechanism

While this approach isn’t perfect, it’s reliable in all scenarios I’ve tested and provides an effective failover mechanism that avoids unnecessary retries or misrouting. This solution ensures that calls are handled efficiently, even in the event of a trunk outage, without introducing any significant delays or side effects.

SolarWinds Integration Progress Update

The integration of SNMP monitoring into the SBCs is an ongoing effort, and significant progress has been made. While this feature isn’t yet complete, the groundwork laid so far demonstrates a clear path forward. Here’s what has been accomplished:

Net-SNMP Integration

Successfully installed and configured the Linux Net-SNMP daemon on the SBCs.
Integrated FreeSWITCH’s built-in metrics using the AgentX protocol, enabling initial SNMP data collection.

SolarWinds Universal Device Pollers (UnDP)

Verified integration of FreeSWITCH metrics into SolarWinds through the Universal Device Poller feature.
Ensured stock metrics are now visible and trackable within the SolarWinds dashboard.

Dynamic SIP Trunk Discovery

Designed a strategy to dynamically discover SIP trunks defined on an SBC using table-based SNMP lookups.
This approach automates the addition of SIP trunk data into SolarWinds, eliminating the need for manual definitions.

Command Line and Shell Scripting

Extensively utilized the fs_cli command-line interface for FreeSWITCH to explore available metrics.
Developed and tested shell scripts using Net-SNMP’s “extend” and “pass” mechanisms to integrate these metrics into SolarWinds.

This progress marks a strong foundation for a fully functional monitoring solution. The project remains active, and I anticipate delivering a further update by this Friday. The next steps include finalizing the dynamic discovery mechanism and refining the data presentation in SolarWinds.

FreeSWITCH Logging

As a systems administrator, understanding and managing FreeSWITCH logging is crucial for maintaining system health and troubleshooting issues efficiently. This guide aims to provide a detailed overview of FreeSWITCH logging, including log file locations, rotation policies, retention periods, log details, and verbosity adjustments.

Out-of-the-Box Logging

FreeSWITCH logs a variety of information by default, including system events, errors, and call handling details. The primary log file is /var/log/freeswitch/freeswitch.log.

Configuring Log Storage, Rotation, Retention, and Verbosity

Log storage location, rotation, retention, and verbosity are all configured within the /etc/freeswitch/autoload_configs/logfile.conf.xml file. Here’s how you can adjust these settings to meet your needs:

 name="logfile.conf.xml">
  
     name="logfile" value="/var/log/freeswitch/freeswitch.log"/>

     name="logrotate-size" value="104857600"/> 
     name="logrotate-count" value="10"/> 

     name="loglevel" value="INFO"/> 
  
     name="all" value="console,info,notice,warning,err,crit,alert"/>
    
   name="uuid" value="true" />

Rotation and Retention: Adjust the configuration settings to meet your retention requirements. For example, you might increase the file size limit or the number of retained logs if necessary.

Verbosity: The default log verbosity in FreeSWITCH was observed to be DEBUG, which may be excessive for regular operation. For my deployment, I have dialed this back to INFO, which provides information suitable for normal operation. Higher verbosity levels, such as DEBUG, are generally used when actively troubleshooting detailed issues. Adjust the verbosity level to your preference.

Log Details for Call Handling

The freeswitch.log file contains detailed information about call handling, including timestamps, log levels, source file names, line numbers, function names, and log data. Here is an example of a sample log line:

12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.021541 99.33% [NOTICE] switch_channel.c:1142 New Channel sofia/internal/25301@192.168.253.222 [12812bf9-98c8-4307-8fca-ca542bad93e6]
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.021541 99.33% [INFO] sofia.c:10460 sofia/internal/25301@192.168.253.222 receiving invite from 192.168.253.222:5060 version: 1.10.12 -release-10222002881-a88d069d6fgit a88d069 2024-08-02 21:02:27Z 64bit call-id: f5ec2cfe-6b6c-43b3-b162-f22216eb5219
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.041536 99.33% [INFO] sofia.c:10460 sofia/internal/25301@192.168.253.222 receiving invite from 192.168.253.222:5060 version: 1.10.12 -release-10222002881-a88d069d6fgit a88d069 2024-08-02 21:02:27Z 64bit call-id: f5ec2cfe-6b6c-43b3-b162-f22216eb5219
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.041536 99.33% [INFO] mod_dialplan_xml.c:639 Processing 25301 <25301>->25401 in context public
12812bf9-98c8-4307-8fca-ca542bad93e6 EXECUTE [depth=0] sofia/internal/25301@192.168.253.222 set(outside_call=true)
12812bf9-98c8-4307-8fca-ca542bad93e6 EXECUTE [depth=0] sofia/internal/25301@192.168.253.222 export(RFC2822_DATE=Sun, 05 Jan 2025 04:02:32 -0500)
12812bf9-98c8-4307-8fca-ca542bad93e6 EXECUTE [depth=0] sofia/internal/25301@192.168.253.222 lua(route_to_ny.lua)
2025-01-05 04:02:32.041536 99.33% [INFO] switch_cpp.cpp:1466 Primary gateway 'ny-sbc' status: UP
2025-01-05 04:02:32.041536 99.33% [INFO] switch_cpp.cpp:1466 Routing through primary gateway: ny-sbc
12812bf9-98c8-4307-8fca-ca542bad93e6 EXECUTE [depth=0] sofia/internal/25301@192.168.253.222 bridge(sofia/gateway/ny-sbc/25401)
91777dc4-8d81-492b-bae1-af495eea7a97 2025-01-05 04:02:32.041536 99.33% [NOTICE] switch_channel.c:1142 New Channel sofia/trunk20/25401 [91777dc4-8d81-492b-bae1-af495eea7a97]
91777dc4-8d81-492b-bae1-af495eea7a97 2025-01-05 04:02:32.041536 99.33% [INFO] sofia_glue.c:1659 sofia/trunk20/25401 sending invite call-id: (null)
2025-01-05 04:02:32.641541 99.33% [INFO] sofia.c:1348 sofia/trunk20/25401 Update Callee ID to "Outbound Call" 
91777dc4-8d81-492b-bae1-af495eea7a97 2025-01-05 04:02:32.641541 99.33% [NOTICE] sofia.c:7604 Ring-Ready sofia/trunk20/25401!
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.641541 99.33% [NOTICE] mod_sofia.c:2514 Ring-Ready sofia/internal/25301@192.168.253.222!
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:32.641541 99.33% [NOTICE] switch_ivr_originate.c:572 Ring Ready sofia/internal/25301@192.168.253.222!
2025-01-05 04:02:41.441459 99.23% [INFO] sofia.c:1348 sofia/trunk20/25401 Update Callee ID to "Outbound Call" 
91777dc4-8d81-492b-bae1-af495eea7a97 2025-01-05 04:02:41.441459 99.23% [NOTICE] sofia.c:8681 Channel [sofia/trunk20/25401] has been answered
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:41.461440 99.23% [NOTICE] sofia_media.c:90 Pre-Answer sofia/internal/25301@192.168.253.222!
12812bf9-98c8-4307-8fca-ca542bad93e6 2025-01-05 04:02:41.461440 99.23% [NOTICE] switch_ivr_originate.c:3855 Channel [sofia/internal/25301@192.168.253.222] has been answered

Remote Syslog

FreeSWITCH supports remote syslog logging through the mod_syslog module. This feature allows you to send log messages to a remote syslog server, which is useful for centralized logging and monitoring.

Load the mod_syslog Module: If the mod_syslog module is not already loaded, you need to load it by adding it to the FreeSWITCH modules configuration file. Open the /etc/freeswitch/autoload_configs/modules.conf.xml file and add the following line:
```
 module="mod_syslog"/>
```

Edit /etc/freeswitch/autoload_configs/switch.conf.xml:

Add the remote syslog configuration:

 name="remote_syslog" priority="1">
   application="log" loglevel="debug"/>
   application="syslog" loglevel="debug" server="syslog.example.com" port="514"/>

It looks to be possible to condition the send based on call criteria if desired via advanced configuration syntax.

Quick Tips and Tricks

Grep by UUID: Use grep /var/log/freeswitch/freeswitch.log to filter log entries by the session’s UUID. This is useful for isolating log details related to a specific call session, even with interleaved calls.
Grep by Value then UUID: First, grep for a specific value, such as a phone number, to identify the UUID of a call attempt. For example: grep "25301" /var/log/freeswitch/freeswitch.log. Then, use the UUID to grep for all related log entries: grep /var/log/freeswitch/freeswitch.log.
Tail the Log File: Use tail -f /var/log/freeswitch/freeswitch.log to continuously monitor the log file in real-time. This is helpful for observing live events and troubleshooting as they happen.
Count Errors: Use grep -c ERROR /var/log/freeswitch/freeswitch.log.* to count the number of error entries in the log files. This can give you a quick overview of the system’s health and highlight recurring issues.

Final Thoughts on Logging

Effective logging practices are essential for proactive system administration. Here are some final tips to help you get the most out of FreeSWITCH logging:

Practice Pulling Log Details:
- Regularly pull log details for test calls and interpret the entries, even if there are no issues. This practice helps develop muscle memory around log file locations and interpretation, ensuring the skills are readily available when needed urgently.
Routinely Examine Log Files:
- Examine log files regularly for benign or unnoticed errors. Try to clean up these errors if possible. If not, be aware of exactly what these benign errors are, so they do not become misleading distractions during emergency troubleshooting.
Keep Debug Logging in Mind:
- Always consider enabling debug logging if the standard logs are not providing enough information. Attempt to replicate the issue or request a fresh example to be performed while debug logging is enabled. This approach can provide deeper insights and help pinpoint the problem.

By incorporating these practices into your regular system maintenance routine, you can ensure that you are well-prepared to handle any issues that arise and maintain a healthy, well-functioning system.

Using `fs_cli` for Troubleshooting FreeSWITCH

When troubleshooting FreeSWITCH, fs_cli is your go-to tool for real-time management and debugging. It connects directly to the core FreeSWITCH process and provides a fast, interactive way to manage logs, monitor SIP traffic, and check profile status.

Getting Started

To connect interactively:

fs_cli

For one-off commands (good for bash scripting):

fs_cli -x "command"

Adjusting Logs

Set the logging level to debug for detailed troubleshooting:

log level debug

Note: Log verbosity changes here are not persistent.

Reduce verbosity when done:

log level info

SIP Tracing

Enable SIP signaling traces for live traffic monitoring:

sofia global siptrace on

Disable it once you’re finished:

sofia global siptrace off

Checking Profile Status

View an overview of all SIP profiles:

sofia status

Dive into the details of a specific profile:

sofia status profile

Beyond the Essentials

fs_cli is packed with commands for deeper troubleshooting:

Show active calls and channels
List registered endpoints
Inspect call variables by UUID

Stick with logs, traces, and profiles for most scenarios, but know the tool can handle much more when needed.

Closing Thoughts

This project highlights my ability to build a practical, multi-city VoIP network that balances complexity with real-world relevance. From SIP profile design to failover logic, it demonstrates a hands-on approach to solving problems and creating systems that work.

While the monitoring component and full validation details aren’t finished, the foundation is solid, and progress is on track for an update by Friday. This reflects my focus on delivering meaningful results rather than rushing to check boxes.

I look forward to discussing the project further and appreciate the opportunity to showcase how I approach systems administration challenges with thoughtfulness and follow-through.

Excellence in Mid-Market UCaaS Delivery

2024-12-22T03:00:00-05:00

Overview

The evolution of unified communications (UC) has presented significant opportunities and challenges for mid-market operators. As technologies like Microsoft Teams disrupt traditional models, operators must adapt to remain competitive. Metaswitch, with its Rhino Telephone Application Server (TAS), offers a platform for telephony software development that enables differentiation in the marketplace. By exploring its deployment, this project aims to demonstrate the potential for mid-market operators to leverage this tool in addressing evolving customer needs.

This technical exploration not only evaluates the Rhino TAS SDK as a platform but also showcases the critical role systems administration plays in deploying and managing customer-facing services effectively.

Background

The telecom industry has long been on a journey to modernize its infrastructure. The shift from aging circuit-switched systems to flexible, packet-switched networks has facilitated the retirement of legacy hardware, with software-based platforms taking over the functions of physical switches. Technologies like VoIP, Voice over LTE (VoLTE), and the IP Multimedia Subsystem (IMS) have been at the heart of these efforts, enabling greater efficiency and scalability.

Throughout this transformation, the importance of maintaining the reliability of traditional telephony was clear. Telephones continue to be indispensable, trusted devices that people rely on, particularly in emergencies, with an expectation that they will simply work without fail. Therefore, while the telecom landscape was modernizing, the industry needed to ensure that this shift did not compromise the bulletproof reliability of dial tone services.

To meet these challenges, Metaswitch’s Rhino Telephone Application Server (TAS) has become a widely adopted platform for both mobile and traditional carriers. Its design prioritizes stability while supporting a broad range of modern protocols, empowering carriers to deliver reliable, high-quality voice and multimedia services over advanced IP networks.

While Rhino TAS is predominantly known for its role in carrier infrastructure, its architecture and capabilities make it highly adaptable for other sectors, including mid-market UCaaS operators. This makes it an ideal platform for those operators seeking to differentiate themselves in the increasingly competitive space. Rhino TAS offers the flexibility and power to support complex, customized telephony services, such as advanced routing, logic, and integrations—capabilities that are key for mid-market operators looking to offer more than just basic PSTN integration services.

Microsoft Teams software addresses many communication needs, but it leaves opportunities for enhancement—both in terms of services and product functionality—that operators can fill. Operators, particularly those with a long-standing presence in the telecom industry—firms that developed robust cloud PBX and CCaaS offerings before Microsoft Teams reshaped the market—are well-positioned to continue creating advanced software for telephony systems. These operators are not newcomers merely meeting the minimum certification requirements for Operator Connect; they bring a deep history of telecom expertise and a proven track record of excellence.

The UCaaS market is rapidly expanding. While Teams is the centerpiece of this transformation, the true value for mid-market UCaaS operators lies in their ability to deliver value-added services that Microsoft doesn’t directly provide. As channel partners of Microsoft, operators play a critical role in offering services such as PSTN integration, legacy system migration, and ongoing support, helping businesses transition smoothly to modern communication solutions.

Opportunities Left for Mid-Market Operators

Big telecom carriers and tech giants often overlook opportunities in the mid-market UCaaS space. Mid-market customers require the same high-touch, detailed migration and integration work as enterprise customers, yet they don’t offer the same large-scale revenue potential that drives the focus of bigger players.

This is where mid-market operators come in. They deliver tailored solutions that address the needs of peer-sized businesses—needs that larger providers are unwilling to fulfill. This creates a distinct niche in the UCaaS market, where success hinges not only on the ability to integrate the right partnerships, technologies, and operational expertise, but also on delivering comprehensive service. The best operators distinguish themselves through superior service delivery—advising, implementing, and supporting their customers in ways the tech giants can’t match.

Differentiation within the Mid-Market Segment

While exceptional service delivery is key to success within the UCaaS mid-market, strategic platform selection plays a role in helping operators enhance that service. While all mid-market operators aim to fill gaps left by larger players, it is the most successful operators who make deliberate platform choices to differentiate themselves. The choice of platform directly influences an operator’s ability to deliver value, foster innovation, and meet the specific needs of their customers.

Broadly speaking, operators can select from three distinct classes of solutions. The first category consists of barebones, “SBC-only” systems, designed to meet the minimum requirements for market entry with minimal investment. The second category adds configurable PBX software alongside the SBC. These platforms offer significant flexibility but are limited by the inherent constraints of off-the-shelf software. The third category encompasses fully custom solutions built on highly adaptable platforms, where the SBC remains foundational, but off-the-shelf PBX software is replaced with custom software development.

To illustrate the range of approaches available to mid-market operators, consider men’s suiting:

Off-the-shelf suits: aligns with basic SBCs and call routing—adequate for meeting minimum qualifications like those required for Operator Connect but offering no meaningful technical differentiation.
Made-to-measure suits: represents the integration of commercially available, configurable PBX software—offering a better fit for customer needs, but still constrained by predefined templates.
Fully bespoke suits: corresponds to custom-developed software on platforms like Metaswitch Rhino TAS. This approach enables operators to break free from predefined limits and craft solutions uniquely tailored to their customers.

Build Rather Than Buy: Market Leadership through Custom Solutions

Operators have a unique opportunity to establish market leadership by positioning themselves as software developers. By doing so, they can move beyond the limitations of off-the-shelf solutions and create offerings that are uniquely suited to their customers’ needs.

Step One: Clearly articulate the gaps left by Teams and design a product to sit ahead of Teams in the call flow. This product should implement the exact features needed to fill those gaps. Think of this as your base product—sleek, purpose-built, and foundational to your service offering.

Step Two: During the sales phase, identify customer needs and customization opportunities, documenting these requirements in the contract or SOW. In the onboarding phase, these requirements are implemented through a focused development sprint, creating custom middleware that integrates with customer end systems, along with other code enhancements. This approach ensures each deployment is tailored precisely to the customer’s needs.

By choosing Metaswitch Rhino TAS, operators can:

Address Gaps Left by Teams: Deliver tailored solutions that broadly enhance and complement Microsoft Teams’ functionality, ensuring they meet the wider market needs.
Tackle Specific Customer Requirements: Provide custom solutions that precisely address the unique and individual demands of each customer.
Enable Future Vision: Empower product management to roadmap and develop new features, never worrying about off-the-shelf PBX software limitations.
Adapt Quickly: Address specific customer needs on your timeline, not an upstream vendor’s.
Deepen Customer Relationships: Offer tailored solutions that position the operator as a long-term partner, not just a service provider.

Calling Your Own Number: A Solution Architecture Proposal

Enhanced call handling begins by placing yourself into the call flow. While SBCs typically route calls based on static identifiers like Dialed Number Identification Service (DNIS), a more agile approach moves the routing configurtion entirely out of the SBC. Instead, this logic resides in a dedicated call-handling application—perhaps called the Routing Manager—a small but powerful system running alongside the SBC.

The SBC is not instructed about the call’s final destination; instead, its configuration hands every incoming call to the Routing Manager. This step represents a key transition in the call flow, providing an opportunity to establish control and implement solutions that address gaps left by downstream systems like Teams or accommodate customer-specific requirements.

The Routing Manager can perform real-time lookups using external systems, such as CRMs or customer databases, whether on-premises or in the cloud, to inform its routing decision. Once the decision is made, the Routing Manager forwards the call to its final destination—such as Teams or another endpoint—via SIP REFER.

For example, imagine a call arriving for a customer with specific routing rules. The Routing Manager could:

Query a CRM or third-party API for live data.
Dynamically adjust routing based on the caller’s history, preferences, or current status.
Apply advanced logic tailored to compliance, business workflows, or other requirements.

Reality Check: Stability as the Cornerstone of Success

While differentiation through advanced features is a compelling sales story, it’s important to recognize that many customers do not have unique technical requirements. For these customers, the decision to partner with a UCaaS provider often hinges on factors like reputation, references, alignment in company size, and the appeal of consolidating telecom needs with a single, trusted partner. This approach provides one point of accountability, eliminates the risk of finger-pointing between vendors, and ensures access to a knowledgeable advisor who understands the market landscape and product offerings.

Retention is driven by stable, dependable service and consistent execution in key areas like billing accuracy, responsive and effective support, efficient move-add-change, and account management. While issues are inevitable, how they are handled can make all the difference. When challenges arise, prompt and transparent resolution not only helps maintain trust but can strengthen the customer relationship by demonstrating a commitment to their success and minimizing disruption. Stability remains the top priority, as technical issues or service interruptions can quickly undermine trust and jeopardize the relationship.

As Microsoft Teams with Calling Plans continues to address its limitations, mid-market operators face growing competition from both big tech giants and peer competitors. In this competitive landscape, technical execution is mandatory—and the role of systems administrators in ensuring solid execution cannot be overstated. At renewal time, the goal is for technical execution and NOC support to be viewed as an asset to retention, rather than an obstacle.

Managing Service Quality and Reliability in UCaaS Operations

Successfully delivering PSTN-integrated UCaaS solutions requires not just robust infrastructure, but also the ability to manage complex relationships and responsibilities across multiple service layers.

Operators manage the call flow from the PSTN, often bundling the telecom carrier into the opportunity. This dual role means operators not only maintain their own SBCs, which sit within the call flow, but also rely on third-party carriers, who are equally vulnerable to outages.

Ultimately, the operator bears responsibility for the service’s overall quality. Any service disruption—whether from the operator’s internal systems or from the third-party telecom carrier—can damage the operator’s reputation: customers see the operator as the single point of contact and accountability.

Outages at the carrier level often provide no visibility to the operator. The first—and sometimes only—indication of an issue may be zero call volume, a metric that can be hard to interpret accurately. Is it a carrier outage, or just a lull? Without proactive visibility into carrier systems, operators must either rely on basic monitoring methods (where zero call volume triggers alerts), which means dealing with false positives, or wait for customer complaints.

In UCaaS environments, balancing call flow resilience with reliable Internet access is crucial. While voice paths can remain functional during an outage, relying on a single carrier for Internet access can disrupt real-time app integrations, like customer data lookups, affecting the user experience. Using blended Internet connections with multiple carriers ensures both call flow and application performance remain stable, even during carrier-specific disruptions. This approach is vital for maintaining service quality and avoiding performance degradation in real-time integrations.

The Role of Expert Systems Administrators in UCaaS

Flawless execution across all operational areas is critical, but systems administrators play an especially vital role in maintaining the infrastructure that powers essential services like call handling, routing, monitoring, and failover mechanisms. This role requires a deep understanding of the systems at play and a proactive approach to ensuring reliability.

Effective monitoring begins with an intimate understanding of your infrastructure stack. This includes knowing the processes (e.g., Jetty, Apache, MySQL) that should be running, their expected quantities, and their roles. It also involves identifying health check and status-oriented URLs. Health check URLs provide basic up/down status, while status-oriented endpoints often expose metrics that can be scraped for deeper insights.

Metrics form the backbone of two critical monitoring tools: alerts and dashboards. Alerts are your early warning system, auto-detecting problems and triggering alarms. They rely entirely on the metrics you’ve collected, so unearthing the right data points is essential. Dashboards, on the other hand, are your daily touchpoint with the system. Regularly reviewing these graphs helps you internalize what “normal” looks like, making it easier to spot anomalies at a glance. This combination of proactive alerts and intuitive dashboards strengthens your ability to detect and respond to issues swiftly.

Logs are another indispensable resource in monitoring. Knowing where they are, how they’re structured, and what types of errors to expect is fundamental. Proactive log analysis can identify problems before they escalate, while post-incident reviews often reveal gaps in detection or response. These insights guide iterative improvements, enhancing system resilience over time.

When incidents occur, a systems administrator’s ability to remain composed and methodical is critical. Troubleshooting demands clear analysis, isolating root causes, and devising immediate workarounds or solutions. This process ensures disruptions are addressed with minimal impact.

Failover mechanisms are another cornerstone of reliability, but their effectiveness depends on realistic design. Too often, failover systems are built for simplistic failure modes, such as complete hardware outages, while more nuanced edge cases—like partial failures or miscommunication between components—are overlooked. For instance, a failure might not propagate properly through the stack, allowing a phone call to proceed along a broken path while bypassing a redundant system that could handle it. Thoughtful failover design, informed by real-world failure scenarios, ensures these mechanisms respond effectively to diverse conditions.

Capacity management is equally important, particularly during failover events. Shifting loads between sites can strain shared resources like SIP trunks and voice channels. Expert administrators anticipate these demands, balancing capacity to prevent overloads and maintain service continuity.

Operational reliability hinges on continuous refinement. Each incident provides lessons—whether from overlooked metrics, delayed alerts, or unforeseen failure modes. Regular evaluations of what worked and what didn’t drive improvements, closing gaps and addressing vulnerabilities. This iterative process ensures systems remain robust, adaptable, and aligned with evolving challenges. By fostering a culture of learning and improvement, systems administrators help maintain customer trust and strengthen the backbone of critical services.

Unboxing Metaswitch Rhino TAS SDK

With the clear market need for advanced, customizable UCaaS solutions in mind, I embarked on a Proof-of-Concept deployment of the Metaswitch Rhino Telephony Application Server (TAS). This PoC serves as the foundational first step in demonstrating the capabilities of Rhino TAS and its potential to address the complex requirements of mid-market operators.

This Proof-of-Concept Deployment outlines my efforts to deploy Rhino TAS software, including its SIP Resource Adapter and sample applications, to create a basic SIP PBX. While this represents only a small portion of Rhino TAS’s full capabilities, it serves as an entry point for engaging with the platform and laying the groundwork for broader deployment. By demonstrating this foundational setup, I aim to showcase the platform’s potential and illustrate the critical role of skilled administrators in leveraging its capabilities effectively.

Reviewing Vendor Documentation and Requirements

In preparation for this project, I carefully reviewed Metaswitch’s documentation to ensure alignment with their stated requirements and best practices. As an experienced IT professional, I understand the importance of thorough preparation and adhering to vendor specifications, especially in environments as complex as telecom systems. Below are the critical takeaways from the documentation and their implications for my project:

Rhino TAS is available in two installer flavors: the “SDK” version, which bundles a free license key for developer testing and evaluation purposes, and the production version, which requires a paid license.
The Rhino TAS SDK version has a built-in rate limit to prevent production use, making it strictly for development and testing purposes.
The Rhino TAS SDK is supported on Red Hat, CentOS, and Ubuntu Linux distributions. Windows is mentioned but not seriously. For production deployments, only Red Hat Enterprise Linux (RHEL) 8 and 9 are supported.
The Rhino TAS SDK supports both Oracle JDK (version 11 only) and OpenJDK (versions 11 or 17). However, OpenJDK must be the version packaged for Red Hat repositories to be supported.
Rhino TAS includes an admin web UI called Rhino Element Manager (REM), with two deployment options: embedded and standalone. While JDK 17 support was recently introduced in the latest version, it is currently only available for the standalone REM; the embedded REM still requires JDK 11.

Analysis and Decision on Rhino TAS Software, OS and JDK Selection

After reviewing Metaswitch documentation and considering my requirements, I have decided to deploy the Rhino TAS SDK version on top of Rocky Linux 9.5 with OpenJDK 17 for the PoC. Standalone REM. Below is the rationale behind this decision:

The SDK version is the only available version of Rhino TAS for my PoC, making it well-suited for evaluation purposes. While the production version may be available for evaluation to qualified prospects, I did not inquire further as the SDK version meets the requirements for this PoC.
Metaswitch’s preference for Red Hat platforms aligns with my decision to use Rocky Linux as a reliable, free RHEL alternative.
I prefer OpenJDK over Oracle JDK because OpenJDK remains free for commercial use, unlike Oracle JDK, which typically requires a paid subscription for commercial environments. A well-known caveat in the industry is that Oracle JDK updates are not available through standard package repositories like dnf or apt, requiring manual updates—a time-consuming process that adds unnecessary hassle to system maintenance.
OpenJDK 17 offers a clear, long-term lifecycle with public updates until 2029, ensuring stability for the next several years. In contrast, OpenJDK 11, though an LTS release, is already past its End of Public Updates (Sept 2023), making it less viable for long-term use without commercial support. Deploying JDK 11 today risks requiring early re-deployment, while JDK 17 provides more extended support, reducing the need for revisits.
Metaswitch has removed Oracle as a supported vendor starting with JDK 17, further supporting the move away from Oracle and reinforcing OpenJDK as the preferred option for future-proofing deployments.
Red Hat Enterprise Linux (RHEL) is not available to me and is unlikely to be accessible to hobbyists who might want to follow along with or replicate this project. CentOS is supported, but Rocky Linux is today considered the successor to CentOS as a reliable RHEL clone.
Rocky Linux includes the same OpenJDK packages created for Red Hat, making it an excellent choice that aligns with Metaswitch’s recommendations.
In telecom, business partners sometimes inquire about underlying systems, especially when the nature of the partnership involves placing a box into the customer’s network. For years, Red Hat has been the best-accepted answer.
While it is generally advisable to stick with supported platforms, this is less critical in a PoC environment where vendor support is not accessible. In Metaswitch’s case, even community support is gated behind paid access. Despite this, I chose a configuration that aligns closely with their documented guidance to maintain best practices and ensure compatibility in the absence of external support.

Preparing the Operating System

The starting point is a basic install from Rocky-9.5-x86_64-minimal.iso with defaults taken at install time. I assigned 2 vCPU and 4 GB RAM. Become the root user or use a privilege escalation mechanism for the commands in this section.

Set a hostname.

hostnamectl set-hostname RhinoTAS-SDK.lab4.decoursey.com

Create a definition of SIP traffic for the OS-level software firewall.

cat << EOF > /etc/firewalld/services/sip.xml

    SIP
    Session Initiation Protocol
    
EOF

Create a definition of REM traffic for the OS-level software firewall.

cat << EOF > /etc/firewalld/services/rem.xml

    rem
    Rhino Element Manager
    
EOF

Reconfigure and reload the OS-level software firewall.
```
firewall-cmd --permanent --add-service=sip
firewall-cmd --permanent --add-service=rem
firewall-cmd --reload
```
Note: Disabling the firewall temporarily during evaluations is a common shortcut to avoid gathering the system’s connectivity requirements. However, this “temporary” measure often becomes permanent. As the system moves into live use, the temptation to skip proper configuration increases, as administrators are reluctant to make changes in production environments to avoid disruption. The best practice is to define and configure firewall rules early in the deployment process.
Run any available updates and, if a kernel update is installed, reboot the machine.
```
dnf -qy update
shutdown -r now
```
Install OpenJDK 17.
```
dnf -qy install java-17-openjdk-devel
```
Install the unzip utility.
```
dnf -qy install unzip
```
Install the tcpdump utility.
```
dnf -qy install tcpdump
```

Creating the rhino-tas User Account

Systems administrators often find themselves building lab systems to hand off to software engineers, application administrators, or other IT professionals. Without clear guidance, these systems can vary widely in configuration, leading to unexpected issues down the line.

Although the Rhino TAS SDK documentation doesn’t explicitly mention it, best practices strongly recommend deploying software under a dedicated, unprivileged user account. Installing it under the root user or, even worse, a personal login account—sometimes mistakenly done—can lead to significant security and operational risks, particularly during the rushed process of deleting login accounts when an employee separates from the company.

Create a rhino-tas user account with a home directory to house and execute the Rhino TAS software.

useradd -r -m -d /usr/local/rhino-tas -s /bin/bash -c "Rhino TAS Service Account" rhino-tas

In professional software packaging, service accounts are given an unusable shell, such as /usr/sbin/nologin. Metaswitch, however, distributes their software as a ZIP file, bypassing the full packaging process and proper daemonization. Given this, and the practical need for on-the-ground operational work and troubleshooting, assigning the interactive shell is a reasonable tradeoff. Locking the account discourages password-based access, mitigating the risk.

Lock the rhino-tas user account.
```
passwd -l rhino-tas
```

Preparing the User’s Java Environment

Switch to the rhino-tas user account.
```
su - rhino-tas
```
Setting JAVA_HOME to /usr/lib/jvm/java-17-openjdk ensures the Rhino-TAS shell environment remains stable and survives minor upgrades by leveraging the JDK package’s symbolic links, which automatically point to the correct version. This approach avoids the risks of hard-coded versions, preventing the environment variable from becoming outdated or pointing to a non-existent location during updates.

Update ~/.bashrc.
```
echo "export JAVA_HOME=/usr/lib/jvm/java-17-openjdk" >> ~/.bashrc
echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> ~/.bashrc
```
Apply the changes.
```
source ~/.bashrc
```
Note: The Rhino TAS SDK auto-generates its configuration file on first startup, capturing the current value of JAVA_HOME at that time. As a result, simply updating JAVA_HOME in the shell environment won’t change the JDK used by the SDK.

Note: The recommendation in the Rhino SDK Getting Started Guide to update ~/.profile is not suitable for bash-based systems and will typically not work as expected.

Installing Rhino TAS Software to the User Environment

Perform these steps, also, as the rhino-tas user, situated in the rhino-tas user’s home directory.

Download the installer files from https://docs.rhino.metaswitch.com/ocdoc/books/devportal-downloads/1.0/downloads-index/

You need three things:
- Rhino TAS SDK installer
- SIP Resource Adapter
- REM Standalone Version
Transfer the archives to the server and place to the rhino-tas user account’s home directory.
```
ls -l
```
Note: If you’re transferring software files to a server, it’s often easiest to first download them to your desktop. Then, use a file transfer tool like WinSCP or FileZilla to upload the files via SFTP or SCP. Upload them initially to a directory where you have write access, such as your home directory or the /tmp directory. Afterward, you can SSH into the server, elevate to root, and use commands like mv and chown to move the ZIP files into the ~rhino-tas directory and ensure proper ownership.

Alternatively, you can download the ZIP files directly to the server using command-line tools like lynx (which may need to be installed). Unlike wget or curl, lynx is recommended because the Metaswitch download site requires you to interactively accept a license agreement, making direct download links elusive.

Unzip the Rhino TAS SDK installer

unzip -q rhino-sdk-install-3.2.13.zip
unzip -q sip-connectivity-3.1.15.zip -d RhinoSDK/

cd into the RhinoSDK directory
```
cd RhinoSDK/
```
Start the service.
```
./start-rhino.sh
```
Note: The startup will produce pages of output, and once the service is fully initialized this will show.
Open a duplicate SSH session and, again, become the rhino-tas user.

cd into the RhinoSDK directory
```
cd RhinoSDK/
```
Drill down two more directories into the cd rhino-connectivity/sip-3.1.15/ directory
```
cd rhino-connectivity/sip-3.1.15/
```
Note: You must navigate to this specific directory because the deployexamples.sh script relies on relative path names for its internal commands. This makes the current working directory (CWD) at the time of script execution critical for proper functionality. Skipping this step will result in errors.
Edit the sip.properties file to specify a SIP domain name for use in your lab.
```
sed -i 's/^PROXY_DOMAINS=.*/PROXY_DOMAINS=lab4.decoursey.com/' sip.properties
```
Note: The sed one-liner is clearer to demonstrate, or use a text editor of your choice.
Execute the deployexamples script
```
./deployexamples.sh
```
Note: this goes on for pages. It’s done when the prompt returns. Should end like this:

Rhino Element Manager

The Rhino TAS SDK does not provide an administrative web UI, but REM (Rhino Element Manager) is a separate component. While its direct relevance to our PoC is unclear, getting it running and connected provides useful familiarity with the platform.

Even if the system is typically managed through other mechanisms, having an admin web UI available can be invaluable when urgent, workaround-type changes are needed outside of normal processes. This sort of flexibility often proves necessary, even when the tool isn’t initially critical to the project.

Using the embedded version would have required downgrading the whole project to JDK 11, so I opted for a standalone setup.

This will be a quick look, and we won’t be bulletproofing the setup since REM won’t be in the call path.

Retrieve the file ~rhino-tas/RhinoSDK/rhino-trust.cert from the server and make it available on your desktop PC.

This file contains the self-signed server certificate generated during the initial startup of the Rhino TAS SDK. REM enforces HTTPS certificate validation for its connection to Rhino TAS SDK, so this certificate must be imported into REM’s trust store via its web UI.

The file is in DER (Distinguished Encoding Rules) format, a binary format commonly used for storing X.509 certificates. To transfer it, use a tool like WinSCP or FileZilla, as DER files cannot be copied through the clipboard like PEM format. If you’re curious about its details, you can inspect the certificate using the following command: openssl x509 -inform DER -in rhino-trust.cert -text -noout
Retrieve the Rhino TAS SDK password by running the following command:
```
grep ^rhino.remote.password ~rhino-tas/RhinoSDK/client/etc/client.properties
```
This password was randomly generated during the initial startup of the Rhino TAS SDK and is used to authenticate REM (the external web UI) with the Rhino TAS SDK backend. Once retrieved, you’ll need to supply this password to REM to establish the connection.

Unzip the REM software.

unzip -q rhino-element-manager-3.2.10.zip

cd into the REM directory.
```
cd rhino-element-manager-3.2.10
```
Execute the rem.sh script.
```
./rem.sh
```
Navigate to http://RhinoTAS-SDK:8080/rem in your desktop web browser. u: emadm p: password

Note: Substitute the IP address or (if resolvable) hostname of the server where you’re installing Rhino TAS SDK.
Once logged into the REM admin web UI, select Edit Element Manager Users and Roles and set a secure password for both the “emadm” user as well as for the “user” user. Document your new passwords. Do not leave the default passwords.
Once logged into the REM admin web UI, select Edit Rhino Instances, click on the “Local” instance, and use the “+” button under Server Cert to upload the server certificate.
Navigate back to the REM main menu for example by clicking the home icon at the top right of the interface. From here, select Manage a Rhino Element. Select Connect To: and then Local. Use the option to edit the saved admin credential and supply the password obtained from step 2 above. Now, retry the Local connetion, which should succeed.

Testing and Validation of the Rhino TAS Install

Spoiler Upfront: As it turns out, the Rhino TAS platform supports logging, but the sample SIP applications I’m using for my lab do not. These applications also lack endpoints for exposing metrics. While I had initially aimed to showcase log analysis, monitoring and alerting as a key part of this project, I’ve had to adjust my approach. My new plan is to deliver monitoring and alerting as a separate deep-dive project.

Testing Focus:

Registration Verification: Can we successfully register soft phones?

Call Placement: Can we place test calls?

Unfortunately, I can’t showcase my log analysis skills with these samples, as they neither log registration nor calls. However, this isn’t a major issue for the platform itself, as the sample applications are not intended for actual use. Any operator using the platform would develop their applications and incorporate necessary logging and metrics.

Setting up to monitor the registration attempt.

The Rhino TAS SDK’s main log file is RhinoSDK/work/log/rhino.log. Let’s open a new SSH connection and start tailing the log file so that we’ll see immediately once there is activity.
```
cd RhinoSDK/work/log
```
```
tail -f rhino.log
```
Note: The tail command displays the last few lines of a text file. The -f option keeps the command running, updating the display with new lines as they are added to the file in real time. This process, called “tailing,” continues until you press [CTRL]-[C] to stop it. For system administrators, tail -f is an essential tool for monitoring live log file activity, making it easier to diagnose issues as they occur.
Let’s get a couple more sessions open to the server and, as root, start some packet captures so that once the registration attempt appears on the network, we’ll see real time.

First, we’ll use tshark (the CLI version of Wireshark) to do a realtime decoding of SIP traffic and dump the analysis to the terminal.
```
tshark -i enp0s3 -f "host 192.168.254.12 and port 5060" -Y "sip" -O sip -V
```
Second, we’ll simultaneously use tcpdump (in yet another root shell) to get a raw pcap. This permits later offline analysis e.g. using the GUI version of Wireshark, or for sharing with a vendor or partner support or with a customer to request help or to illustrate your point.
```
tcpdump -i enp0s3 -s0 -w Dec20-0405pm.pcap
```
Note: Packet capture is a key troubleshooting tool, usually used when diagnosing issues. The tshark command works well for live analysis on the server, giving immediate visibility into SIP traffic. A full pcap is better for doing offline analysis after replicating issues, or sharing with a partner, vendor, or customer for help. If you plan to share the capture, ensure sensitive information is avoided or removed. In this case, I’m setting up the pcap ahead of time to show how it’s done. The IP address 192.168.254.12 is the host where I’ll run the first softphone.

Testing the registration service.

Let’s try registering from a softphone client application. I will use MicroSIP.

My tshark packet trace session comes alive with this detail.

Frame 556: 605 bytes on wire (4840 bits), 605 bytes captured (4840 bits)
Ethernet II, Src: Chongqin_11:42:5f (8c:c8:4b:11:42:5f), Dst: PcsCompu_0e:6e:65 (08:00:27:0e:6e:65)
Internet Protocol Version 4, Src: 192.168.254.12, Dst: 192.168.254.51
User Datagram Protocol, Src Port: 59000, Dst Port: 5060
Session Initiation Protocol (REGISTER)
    Request-Line: REGISTER sip:192.168.254.51 SIP/2.0
        Method: REGISTER
        Request-URI: sip:192.168.254.51
            Request-URI Host Part: 192.168.254.51
        [Resent Packet: False]
    Message Header
        Via: SIP/2.0/UDP 192.168.254.12:59000;rport;branch=z9hG4bKPj4c75b6e28e1949cd97d27a5022f94541
            Transport: UDP
            Sent-by Address: 192.168.254.12
            Sent-by port: 59000
            RPort: rport
            Branch: z9hG4bKPj4c75b6e28e1949cd97d27a5022f94541
        Route: 
            Route URI: sip:192.168.254.51;lr
                Route Host Part: 192.168.254.51
                Route URI parameter: lr
        Max-Forwards: 70
        From: ;tag=bbf84b96e8334f678d44bc05366381ef
            SIP from address: sip:1051@lab4.decoursey.com
                SIP from address User Part: 1051
                SIP from address Host Part: lab4.decoursey.com
            SIP from tag: bbf84b96e8334f678d44bc05366381ef
        To: 
            SIP to address: sip:1051@lab4.decoursey.com
                SIP to address User Part: 1051
                SIP to address Host Part: lab4.decoursey.com
        Call-ID: eb8da9f55e3d43bd8d29f48365424ca9
        [Generated Call-ID: eb8da9f55e3d43bd8d29f48365424ca9]
        CSeq: 19789 REGISTER
            Sequence Number: 19789
            Method: REGISTER
        User-Agent: MicroSIP/3.21.5
        Contact: 
            Contact URI: sip:1051@192.168.254.12:59000;ob
                Contact URI User Part: 1051
                Contact URI Host Part: 192.168.254.12
                Contact URI Host Port: 59000
                Contact URI parameter: ob
        Expires: 300
        Allow: PRACK, INVITE, ACK, BYE, CANCEL, UPDATE, INFO, SUBSCRIBE, NOTIFY, REFER, MESSAGE, OPTIONS
        Content-Length:  0
    
Frame 559: 453 bytes on wire (3624 bits), 453 bytes captured (3624 bits)
Ethernet II, Src: PcsCompu_0e:6e:65 (08:00:27:0e:6e:65), Dst: Chongqin_11:42:5f (8c:c8:4b:11:42:5f)
Internet Protocol Version 4, Src: 192.168.254.51, Dst: 192.168.254.12
User Datagram Protocol, Src Port: 5060, Dst Port: 59000
Session Initiation Protocol (200)
    Status-Line: SIP/2.0 200 OK
        Status-Code: 200
        [Resent Packet: False]
        [Request Frame: 556]
        [Response Time (ms): 234]
    Message Header
        Via: SIP/2.0/UDP 192.168.254.12:59000;rport=59000;branch=z9hG4bKPj4c75b6e28e1949cd97d27a5022f94541
            Transport: UDP
            Sent-by Address: 192.168.254.12
            Sent-by port: 59000
            RPort: 59000
            Branch: z9hG4bKPj4c75b6e28e1949cd97d27a5022f94541
        From: ;tag=bbf84b96e8334f678d44bc05366381ef
            SIP from address: sip:1051@lab4.decoursey.com
                SIP from address User Part: 1051
                SIP from address Host Part: lab4.decoursey.com
            SIP from tag: bbf84b96e8334f678d44bc05366381ef
        To: 
            SIP to address: sip:1051@lab4.decoursey.com
                SIP to address User Part: 1051
                SIP to address Host Part: lab4.decoursey.com
        Call-ID: eb8da9f55e3d43bd8d29f48365424ca9
        [Generated Call-ID: eb8da9f55e3d43bd8d29f48365424ca9]
        CSeq: 19789 REGISTER
            Sequence Number: 19789
            Method: REGISTER
        Contact: ;expires=300;q=0.0
            Contact URI: sip:1051@192.168.254.12:59000;ob
                Contact URI User Part: 1051
                Contact URI Host Part: 192.168.254.12
                Contact URI Host Port: 59000
                Contact URI parameter: ob
            Contact parameter: expires=300
            Contact parameter: q=0.0
        Date: Fri, 20 Dec 2024 21:30:53 GMT
        Content-Length: 0

This is a successful SIP registration. The system allowed registration without any configuration for the user or extension. Keep in mind, this is a sample application designed to help developers get started with SIP. Authentication is not implemented to keep the setup simple for initial development.

The source code for the registration functionality can be found in the file RhinoSDK/rhino-connectivity/sip-3.1.15/src/com/opencloud/slee/services/sip/registrar/RegistrarSbb.java. There is a placeholder comment where authentication and authorization would be added.

Note: The registration process typically follows a challenge-response strategy. The client sends a REGISTER request without credentials, and the server responds with a 401 Unauthorized message, prompting the client to provide the correct credentials. Seeing the 401 Unauthorized response is normal and simply indicates the server is prompting for credentials.

Error observed in the platform logs

Recall I was tailing the log file during the registration process. I did notice an error.

2024-12-21 02:30:18.812-0500 Warning [trace.sipra.sip.fail]  {} incoming message from /192.168.254.12:57898 rejected: parse failed, drop message
message buffer contents:

java.text.ParseException: no character matching rule token found at current index, buf=""
         at com.opencloud.slee.resources.sip.parser.Lexer.makeParseException(Lexer.java:799)
         at com.opencloud.slee.resources.sip.parser.Lexer.getCharacterSequence(Lexer.java:777)
         at com.opencloud.slee.resources.sip.parser.Lexer.getCharacterSequence(Lexer.java:742)
         at com.opencloud.slee.resources.sip.parser.Lexer.getNextToken(Lexer.java:630)
         at com.opencloud.slee.resources.sip.parser.SipParser.parseFirstLine(SipParser.java:345)
         at com.opencloud.slee.resources.sip.parser.SipParser.parseFirstLine(SipParser.java:241)
         at com.opencloud.slee.resources.sip.transport.handler.SipMessageDecoder.decodeStep(SipMessageDecoder.java:137)
         at com.opencloud.slee.resources.sip.transport.handler.SipMessageDecoder.decode(SipMessageDecoder.java:105)
         at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)            at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
         at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
         at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
         at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
         at com.opencloud.slee.resources.sip.transport.handler.NetworkEventHandler.channelRead(NetworkEventHandler.java:155)
[ ... snipped for brevity ... ]

This is a Java stack trace, which is a standardized format for error messages in Java applications. Stack traces are human-readable to some extent and often provide clues as to what went wrong. In this case, it appears that the server encountered an issue while trying to parse a SIP message, indicating that the message was invalid.

Stack traces are especially useful because they usually include the filename and line number from the source code where the error occurred. If you have access to the source code, this can be extremely helpful for troubleshooting. When analyzing a stack trace, start from the top.

The registration was successful, which suggests a benign (noise) error — one that appears in the logs but doesn’t affect normal usage. From tailing the logs in real time, it was clear that the error occurred about every 15 seconds, which didn’t even align with the registration interval, but rather aligned with the keep-alives. This pointed to the error being caused by MicroSIP not properly formatting the keep-alive packets. Even though these packets don’t need to be processed by the server — they’re only meant to maintain the network path through stateful firewalls or NAT gateways — more care should have been taken in their implementation. It appears to be a minor bug on the MicroSIP side, causing invalid requests.

My solution was to disable the keep-alives and reduce the registration interval to 120 seconds, which should be plenty frequent enough to keep the connection alive.

Testing the proxy service

In setting up for the test call, I have again started packet captures. This time, I am doing simultaneous packet capture on the PBX server as well as on the callee’s machine, covering the duration of the test call.

 tcpdump -i enp0s3 -s0 -w Dec21-0409pm.pcap

Note: When using tcpdump on Linux, including a file extension like .pcap is good practice. Though not required by Linux, it ensures the file is easily recognizable and opens correctly in tools like Wireshark on other systems. Without the extension, the file might need to be opened manually or renamed for proper recognition.

I have registered two soft phones: ext. 1050 is situated on 192.168.254.52 and ext. 1051 situated on 192.168.254.12.

The Rhino PBX SDK server is 192.168.254.51.

The test will involve ext. 1050 calling ext. 1051.

The packet capture taken during the test call reveals the dual nature of the call flow from the PBX’s perspective. Positioned as the middlebox, the PBX handles both the inbound leg from the caller and the outbound leg toward the callee in a unified manner.

The SIP protocol governs the signaling phase of this communication. In this exercise, I captured the full handshake sequence, showcasing both the SIP INVITE and its corresponding 200 OK responses from each leg of the call.

The INVITE Process

Here’s the INVITE from the caller’s softphone to the PBX:

INVITE sip:1051@lab4.decoursey.com SIP/2.0
Via: SIP/2.0/UDP 192.168.254.52:63999;rport;branch=z9hG4bKPjad86691ceb7f487983238fe4d8da55b0
From: "Laptop Softphone" ;tag=e778c0943dee4a86a1dac5cd95729d32
To: 
Contact: "Laptop Softphone" 
Call-ID: 7111c5ccb078431d94b2fa1ddc3f1c7a
CSeq: 7081 INVITE
Content-Type: application/sdp
Content-Length: 346

v=0
o=- 3943786649 3943786649 IN IP4 192.168.254.52
s=pjmedia
c=IN IP4 192.168.254.52
t=0 0
m=audio 4010 RTP/AVP 8 0 101
a=rtpmap:8 PCMA/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv

The PBX processes and relays the INVITE to the callee:

INVITE sip:1051@192.168.254.12:62181;ob SIP/2.0
Via: SIP/2.0/UDP 192.168.254.51:5060;branch=z9hG4bKa246d7f899643f59a1937bd5941b1e73;rport
From: "Laptop Softphone" ;tag=e778c0943dee4a86a1dac5cd95729d32
To: 
Contact: "Laptop Softphone" 
Call-ID: 7111c5ccb078431d94b2fa1ddc3f1c7a
CSeq: 7081 INVITE
Record-Route: 
Content-Type: application/sdp
Content-Length: 346

v=0
o=- 3943786649 3943786649 IN IP4 192.168.254.52
s=pjmedia
c=IN IP4 192.168.254.52
t=0 0
m=audio 4010 RTP/AVP 8 0 101
a=rtpmap:8 PCMA/8000
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv

The 200 OK Responses

The callee responds with a 200 OK:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 192.168.254.51:5060;branch=z9hG4bKa246d7f899643f59a1937bd5941b1e73;rport
From: "Laptop Softphone" ;tag=e778c0943dee4a86a1dac5cd95729d32
To: ;tag=e12935090f894164886200e94f61a828
Call-ID: 7111c5ccb078431d94b2fa1ddc3f1c7a
CSeq: 7081 INVITE
Contact: 
Content-Type: application/sdp
Content-Length: 321

v=0
o=- 3943786647 3943786648 IN IP4 192.168.254.12
s=pjmedia
c=IN IP4 192.168.254.12
t=0 0
m=audio 4018 RTP/AVP 8 101
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv

The PBX forwards the 200 OK back to the caller:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 192.168.254.52:63999;branch=z9hG4bKPjad86691ceb7f487983238fe4d8da55b0;rport
From: "Laptop Softphone" ;tag=e778c0943dee4a86a1dac5cd95729d32
To: ;tag=e12935090f894164886200e94f61a828
Call-ID: 7111c5ccb078431d94b2fa1ddc3f1c7a
CSeq: 7081 INVITE
Contact: 
Content-Type: application/sdp
Content-Length: 321

v=0
o=- 3943786647 3943786648 IN IP4 192.168.254.12
s=pjmedia
c=IN IP4 192.168.254.12
t=0 0
m=audio 4018 RTP/AVP 8 101
a=rtpmap:8 PCMA/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16
a=sendrecv

Media Stream Prediction

The 200 OK response is more than just an acknowledgment; it also informs the caller about where and how to set up the media stream. Specifically, the SDP in the 200 OK indicates the callee’s media connection details:

Media IP: 192.168.254.12
Media Port: 4018

Given this, we predict the RTP stream from the caller to the callee will originate from 192.168.254.52:4010 and terminate at 192.168.254.12:4018. In the reverse direction, the RTP stream will start at 192.168.254.12:4018 and end at 192.168.254.52:4010.

Validating with Callee’s PCAP

Using the callee’s packet capture to analyze RTP streams:

lincoln@DESKTOP-LINCOLN:~$ tshark -r callee.pcapng -q -z rtp,streams
========================= RTP Streams ========================
   Start time      End time     Src IP addr  Port    Dest IP addr  Port       SSRC          Payload  Pkts         Lost   Min Delta(ms)  Mean Delta(ms)   Max Delta(ms)  Min Jitter(ms) Mean Jitter(ms)  Max Jitter(ms) Problems?
    29.685967     35.505983  192.168.254.12  4018  192.168.254.52  4010 0x2CD504B0            g711A   292     0 (0.0%)           9.838          20.000          30.307           0.004           1.312           3.256
    29.681122     35.581229  192.168.254.52  4010  192.168.254.12  4018 0x4B7129E9            g711A   296     0 (0.0%)          11.710          20.000          28.212           0.014           0.388           1.507
==============================================================
lincoln@DESKTOP-LINCOLN:~$

As predicted, the RTP streams flow directly between the endpoints without PBX intervention, confirming the separation of signaling and media typical in SIP.

The PBX’s role as a signaling intermediary is evident in its use of Record-Route headers and relaying of SIP messages. However, it does not act as a media proxy; the RTP flows directly between endpoints. This separation of signaling and media, characteristic of SIP, underscores its efficiency in network resource usage.

Conclusion

Reflecting on this project, I am eager to bring my skills and experience to a team committed to operational excellence. The challenges of ensuring system reliability and delivering seamless service are ones I approach with respect and enthusiasm. I look forward to stepping into a role where I can contribute to building robust, well-monitored systems that support the highest standards of performance and reliability.

This proof-of-concept lab underscores my proactive mindset and readiness to engage with the real-world complexities of systems administration. While I know there is always more to learn, I am excited by the opportunity to grow alongside a team that values precision, stability, and continuous improvement.

My focus is on delivering excellence—leveraging my skills in monitoring, troubleshooting, and maintaining critical infrastructure, while embracing new challenges with humility and determination. I am eager to collaborate, learn, and contribute to ensuring the success of the systems and the people who rely on them. Together, we can tackle the demands of this ever-evolving field and achieve outstanding results.

VyOS HA in Vultr with BGP and VRRP

2024-03-04T03:00:00-05:00

Introduction

In a previous post, I introduced VyOS and Vultr and teased follow-up posts that would show the deployment of VyOS as an edge router and perimeter firewall in front of an internal network, which would amount to a novel solution for the VPS space, where nothing like that is typically considered.

The technical work is complete to proof-of-concept standards. My personal domain’s core services are now self-operated by me, at Vultr, behind paired VyOS edge routers, with firewalling, network segmentation, dual-stack IPv4 & IPv6, and DNSSEC signing. This includes authoritative DNS, email, and web service for this blog.

In each of my next several posts, I’ll pick one aspect of the solution and deep dive it, in order to stay true to my IT philosophy and my plan (see my “About” page) for this blog:

Just about anybody can slam something in quickly and haul butt. That’s not how to succeed in IT, and that’s not where I’m at in this stage of life. I want to set myself apart by doing it well, and by using the blog to document along the way.

Topics will include redundancy, firewall, network segmentation (WAN, DMZ and intranet), remote access VPN, dual-stack IPv4 and IPv6, provisioning automation, configuration management, logging, monitoring and alerting. Beyond the VyOS network devices themselves, afterward, I’m apt to move on to talk about the core Internet services, and how I operate these within the DMZ, in a Linux environment.

Today’s topic

The first deep dive topic centers around redundancy and fail-over. I will start with some background and guiding principles, then focus in on network device redundancy generally, talk about where VyOS and Vultr come in on this, then select, implement, and validate my solution for VyOS HA in Vultr.

Background

In professional IT, going to production involves addressing redundancy. You will see terms like active/active, active/standby, manual and automatic fail-over, and you will encounter redundancy deployed both within the data center as well as schemes involving two or more data centers.

Engineering for redundancy adds cost and introduces complexity, all of which must be weighed, prioritized, and expertly balanced.

Dos and Don’ts:

Do have an idea of the SLAs and SLOs to be hit before proceeding to design.
Do brainstorm likely failure scenarios to address in your initial design.
Do pay special attention to failure modes where a server is merely off in the weeds, or has lost connectivity to a backend, not just the case where the server dies outright.
Do test and validate your fail-over mechanisms to the best of your ability.
Do incorporate alerting to be notified of the problem: if fail-over works, you won’t see an outage.
Do understand and document any caveats of the failover mechanism, such as degraded UI, or if users will need to log back in. Be up-front with stakeholders.
Do postmortems to discuss what worked, what didn’t, and where you find room to improve, do so.
Don’t obsess over the first-pass design. What’s important is that you have some HA story to go to market. The rest will come from hard-won experience. “Why didn’t the fail-over work?” is a question executives will ask. “It was an edge case and we’re fixing it.”
Don’t think about how clever you can look today. Think 9 months down the line when whatever you put in today will need to be manipulated and leveraged during an emergency. Can you be back up to speed with it in 5 minutes, and teach it to somebody else in 10 minutes?
Don’t allow in excess complexity to the point it becomes counter-productive. You can trip over your own feet, causing the outage you had aimed to prevent.

Flashy isn’t the goal; adopting a solution that meets your company’s needs and that your team can effectively implement and manage is.

Network device redundancy

Enterprise-grade network gear has a reputation of being extremely reliable. Failures are not expected to occur within the useful life of a device. Because both the up-front and ongoing maintenance costs for enterprise network devices are considerable, and because these devices have to compete for attention with other, more-likely failure points in contingency planning, not all projects will deploy redundant network hardware.

That said, the failure of a core piece of network gear, apart from a redundant counterpart, will mean at least hours of site downtime, especially when equipment is placed in third-party data centers where external partners, namely remote-hands technicians and vendor service personnel, are relied upon to respond.

What is more, a secondary device stands by not only for the failure of the primary device, but also to cover during maintenance, permitting advisory-only (“no impact is anticipated”) maintenance window notifications. Finally, when unexpected problems do occur with maintenance, these can oftentimes be detected before progressing to the secondary firewall, avoiding outages in scenarios of bad updates or configuration mistakes.

Today’s challenge

The Virtual Router Redundancy Protocol (VRRP) became a standard approach for ensuring network redundancy in production environments during the mid-2000s. Through VRRP, primary and secondary routers use heartbeats to check each other’s status, facilitating automatic fail-over to ensure continuous network operation and redundancy.

Border Gateway Protocol (BGP) can work in tandem with VRRP to manage WAN IP addressing across edge routers. With VRRP establishing primary and secondary roles for routers, ensuring internal network redundancy, BGP handles the external routing, allowing the WAN IP to “float” or transition seamlessly between the edge routers during fail-over events.

The VyOS manual’s chapter on HA is built around VRRP entirely, and Vultr’s KB has an article called “High Availability on Vultr with Floating IP and BGP” with sample configuration that we should be able to port to VyOS, since VyOS will definitely have robust support for BGP.

Architecture of the solution

Each edge router will have a dedicated external IP unique to it, but my domain’s core Internet services will be advertised on separate floating “service” IP addresses that are routed to my edge devices on their primary IPs using BGP ECMP/anycast, but with my primary edge device prioritized using AS path prepend.

I will have two public IPs (a primary and a secondary) for each of my three core services (DNS, SMTP, HTTPS) for a total of six floating public IPs. I will prioritize the same edge router in the VRRP configuration and hope that BGP and VRRP stay in sync; roughly, they should. I will also use a conntrack sync mechanism in attempt to cover some of the gray area and to avoid session drops during failover.

VRRP is a first hop redundancy protocol, and does not address whether the active router is actually capable of forwarding packets further. A dynamic interior protocol like OSPF could potentially do a better job to integrate with BGP, but I don’t have a comfort level to push anything like that to production, and all of the community documentation, which I will rely on, says VRRP. I would rather accept risk of an unlikely-to-encounter edge case for nonconvergence than to put in a too-complex solution that I’m not the master of: it would be a security risk and counter-productive to stability.

OSPF is something I would love to play with in the near future and I will do that in a lab environment where I plan to operate a routing core in addition to edge routers.

BGP peering with Vultr

BGP prefix-lists

set policy prefix-list VULTR-NJ-v4 rule 10 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 10 prefix '45.76.4.167/32'
set policy prefix-list VULTR-NJ-v4 rule 20 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 20 prefix '45.76.6.22/32'
set policy prefix-list VULTR-NJ-v4 rule 30 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 30 prefix '45.76.10.33/32'
set policy prefix-list VULTR-NJ-v4 rule 40 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 40 prefix '45.76.6.7/32'
set policy prefix-list VULTR-NJ-v4 rule 50 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 50 prefix '45.76.6.121/32'
set policy prefix-list VULTR-NJ-v4 rule 60 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 60 prefix '45.76.11.196/32'
set policy prefix-list VULTR-NJ-v4 rule 70 action 'permit'
set policy prefix-list VULTR-NJ-v4 rule 70 prefix '45.63.21.196/32'
set policy prefix-list6 VULTR-NJ-v6 rule 10 action 'permit'
set policy prefix-list6 VULTR-NJ-v6 rule 10 prefix '2001:19f0:5:416::/64'
set policy prefix-list6 VULTR-NJ-v6 rule 20 action 'permit'
set policy prefix-list6 VULTR-NJ-v6 rule 20 prefix '2001:19f0:1000:6946::/64'
set policy prefix-list6 VULTR-NJ-v6 rule 30 action 'permit'
set policy prefix-list6 VULTR-NJ-v6 rule 30 prefix '2001:19f0:5:34cd::/64'

BGP route maps (don’t be that guy)

set policy route-map 64515v4-IN rule 10 action 'deny'
set policy route-map 64515v4-OUT rule 10 action 'permit'
set policy route-map 64515v4-OUT rule 10 match ip address prefix-list 'VULTR-NJ-v4'
set policy route-map 64515v6-IN rule 10 action 'deny'
set policy route-map 64515v6-OUT rule 10 action 'permit'
set policy route-map 64515v6-OUT rule 10 match ipv6 address prefix-list 'VULTR-NJ-v6'

Private AS peer to Vultr

set protocols bgp 4288000595 address-family ipv4-unicast network 45.63.21.196/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.4.167/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.6.7/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.6.22/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.6.121/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.10.33/32
set protocols bgp 4288000595 address-family ipv4-unicast network 45.76.11.196/32
set protocols bgp 4288000595 address-family ipv6-unicast network 2001:19f0:5:34cd::/64
set protocols bgp 4288000595 address-family ipv6-unicast network 2001:19f0:5:416::/64
set protocols bgp 4288000595 address-family ipv6-unicast network 2001:19f0:1000:6946::/64
set protocols bgp 4288000595 neighbor 169.254.169.254 address-family ipv4-unicast nexthop-self force
set protocols bgp 4288000595 neighbor 169.254.169.254 address-family ipv4-unicast remove-private-as
set protocols bgp 4288000595 neighbor 169.254.169.254 address-family ipv4-unicast route-map export '64515v4-OUT'
set protocols bgp 4288000595 neighbor 169.254.169.254 address-family ipv4-unicast route-map import '64515v4-IN'
set protocols bgp 4288000595 neighbor 169.254.169.254 ebgp-multihop '2'
set protocols bgp 4288000595 neighbor 169.254.169.254 password 'redactedP@ssw0rd'
set protocols bgp 4288000595 neighbor 169.254.169.254 remote-as '64515'
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 address-family ipv6-unicast nexthop-self force
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 address-family ipv6-unicast remove-private-as
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 address-family ipv6-unicast route-map export '64515v6-OUT'
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 address-family ipv6-unicast route-map import '64515v6-IN'
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 ebgp-multihop '2'
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 password 'redactedP@ssw0rd'
set protocols bgp 4288000595 neighbor 2001:19f0:ffff::1 remote-as '64515'
set protocols bgp 4288000595 parameters router-id '45.76.0.255'

VRRP configuration inside

set high-availability vrrp group vpc-nj-dmz-v4vip interface 'eth1'
set high-availability vrrp group vpc-nj-dmz-v4vip priority '200'
set high-availability vrrp group vpc-nj-dmz-v4vip virtual-address 10.76.2.1/24
set high-availability vrrp group vpc-nj-dmz-v4vip vrid '21'
set high-availability vrrp group vpc-nj-dmz-v6vip interface 'eth1'
set high-availability vrrp group vpc-nj-dmz-v6vip priority '200'
set high-availability vrrp group vpc-nj-dmz-v6vip virtual-address 2001:19f0:5:416::1/64
set high-availability vrrp group vpc-nj-dmz-v6vip vrid '22'
set high-availability vrrp group vpc-nj-intranet-v4vip interface 'eth2'
set high-availability vrrp group vpc-nj-intranet-v4vip priority '200'
set high-availability vrrp group vpc-nj-intranet-v4vip virtual-address 10.76.4.1/24
set high-availability vrrp group vpc-nj-intranet-v4vip vrid '41'
set high-availability vrrp group vpc-nj-intranet-v6vip interface 'eth2'
set high-availability vrrp group vpc-nj-intranet-v6vip priority '200'
set high-availability vrrp group vpc-nj-intranet-v6vip virtual-address 2001:19f0:5:34cd::1/64
set high-availability vrrp group vpc-nj-intranet-v6vip vrid '42'
set high-availability vrrp sync-group MAIN member 'vpc-nj-dmz-v4vip'
set high-availability vrrp sync-group MAIN member 'vpc-nj-dmz-v6vip'
set high-availability vrrp sync-group MAIN member 'vpc-nj-intranet-v4vip'
set high-availability vrrp sync-group MAIN member 'vpc-nj-intranet-v6vip'

conntrack sync

set service conntrack-sync failover-mechanism vrrp sync-group 'MAIN'
set service conntrack-sync interface eth1
set system conntrack modules ftp
set system conntrack modules h323
set system conntrack modules nfs
set system conntrack modules pptp
set system conntrack modules sip
set system conntrack modules sqlnet
set system conntrack modules tftp

VyOS Platform build for Vultr Cloud

2024-01-11T21:00:00-05:00

Introduction

Virtual Private Server (VPS) offerings have, forever, remained popular with hobbyists due to their unmatched accessibility. Compared to how IT professionals approach entry to any new data center, however, deployment to a VPS provider involves serious compromises. Both internal switching and the perimeter firewall are absent, precluding the involvement of logical network designs from modern IT.

Public cloud offerings, situated adjacent to VPS, do offer solutions, but cloud engineering, while accepted as competing well in the marketplace of ideas, is a distinct IT practice area. The presence of outsourcing as an alternative to infrastructure ownership does not constitute a solution to the problem of a missing low-barrier sandbox for learning and practicing traditional IT skills such as systems and network administration.

Let’s introduce two key players: VyOS, and Vultr, and propose these as partners in a potential solution.

VyOS is an open-source network operating system for x86-64 architecture. VyOS is directly comparable to Cisco and Juniper in terms of protocol support and configuration syntax. VyOS looks, feels and plays like an enterprise-grade router, and skills learned deploying and managing VyOS are enterprise skills.

VyOS differentiates itself in the marketplace on two key points:

instead of a physical device needing to be purchased and racked, it is a software solution, and
instead of closed-source or open-core model software, it is fully open-source software.

Vultr, unlike a bare-bones VPS provider, does tackle the modern IaaS market, but it retains a pricing structure and user interface that are each recognizable to the traditional VPS consumer. Vultr is well-regarded, and highly performant, and it offers a free trial for new signups.

Here are what I consider to be Vultr’s key features:

presence in the Terraform registry,
KVM-based virtualization with cloud-init support,
internal “VPC” networking, and
BGP and IPv6 support.

Today’s challenge

As a commercial open-source project, VyOS restricts download access to its official releases to paying subscribers, and it’s priced for enterprises. While there are a few ways you might qualify for free access (see https://vyos.net/get/), most people will not.

The solution is to build our own VyOS release. The VyOS project provides a combination of good documentation and excellent tooling which makes this easy.

In this blog post, I will show how to build VyOS 1.3.x equuleus, at this time the latest VyOS LTS release, for deployment to the Vultr Cloud platform. Follow-up blog posts will complete the deployment of VyOS as an edge router and perimeter firewall in front of a robust, multi-segmented internal network on the Vultr Cloud platform.

Outline of the solution

Deploy a Cloud Instance (the “build instance”) via Vultr portal
SSH to the build instance as the root user
Install Docker Engine on the build instance
Execute the VyOS ISO build procedure
Configure web server software to host the ISO
Use Vultr portal to pull the ISO into the Vultr account
Save a copy of the ISO and destroy the build instance
Intermission and additional background
Deploy a second Cloud Instance (the “template instance”) via Vultr portal
Access the template instance by its virtual console and install VyOS
Snapshot the template instance
Destroy the template instance
Validation
Credits

Deploy a Cloud Instance (the “build instance”) via Vultr portal

In your Vultr portal, under Products > Compute, select Deploy > Deploy New Server.
Fill out the form to specify details about your new instance.
1. Cloud Compute > Regular Performance (AMD or Intel) server is fine.
2. Debian 12 x64
3. Select an instance type with 25 GB SSD.
4. Specify a hostname, vyos-build.
5. Optionally add an SSH key, or just plan to SSH using root password.
6. Deploy Now
Observe your vyos-build instance running in Vultr portal. Note its IP address. Also, drill in to retrieve the root user’s credential unless you pushed your own SSH key.

SSH to the build instance as the root user

You will need your vyos-build instance’s IP address and credentials from the Deploy step above.
Use any SSH client (such as PuTTY) to connect to your vyos-build instance. The username is root regardless of whether you are using the root user’s password or have pushed an SSH key; the key, if provided, was installed to the root user’s account, no named user account was created.
You are ready to move forward once you have obtained a root shell:

Install Docker Engine on the build instance

There are different ways you can build VyOS. Building using a Docker container is the approach I will cover.

You will need to have Docker Engine installed. The version in Debian’s package repository is adequate.

It’s a one-line install:
```
apt -y install docker.io
```
You’re on track if your kick-off of the command looks roughly like this:

Execute the VyOS ISO build procedure

To recap, you should currently be logged into a 25 GB SSD Cloud Instance on Vultr, have Docker Engine installed, and be sitting at a root prompt in the root user’s home directory. If that’s where you are, you’re ready to move forward.

This is what you need to do to build your VyOS 1.3 LTS release ISO.

Pull the Docker image that will be used to build the ISO:
```
docker pull vyos/vyos-build:equuleus
```
Successful completion should look like this (after pages of output):

Clone the repository:

git clone -b equuleus --single-branch https://github.com/vyos/vyos-build vyos-build-1.3

This is how it should look in the shell:

Switch into the cloned repository:
```
cd vyos-build-1.3/
```
Copy in the Vultr apt repository signing key (we will integrate Vultr’s cloud-init):
```
cp /etc/apt/trusted.gpg.d/vultr-apprepo.gpg .
```
Run the build container:
```
docker run --rm -it --privileged -v $(pwd):/vyos -w /vyos vyos/vyos-build:equuleus bash
```
This switches into the build environment. Notice how the prompt changes:

Run the configure script in the container:

./configure \
  --architecture amd64 \
  --build-by lincoln@decoursey.com \
  --build-type release \
  --version "1.3-$(date +'%Y-%m-%d')" \
  --custom-apt-entry "deb [arch=amd64] https://apprepo.vultr.com/debian universal main" \
  --custom-apt-key /vyos/vultr-apprepo.gpg \
  --custom-package cloud-init

How it should look:

Create the ISO:
```
make iso
```
This takes a while so feel free to step away. When you do get your prompt back it should look like this:
Once the above step is completed, you should be able to see your ISO file in the filesystem.
```
ls -ltr build
```
Make note of your ISO image filename, as you will need to substitute it into some later commands.
Exit the Docker container & return you to the host OS.
```
exit
```
Notice the prompt changes back:
Place a copy of the ISO file into the root user’s home directory before moving on. This is just to be foolproof. You need to substitute your actual ISO filename into the sample command below:
```
cp build/vyos-1.3-2024-01-07-amd64.iso ~
```

Configure web server software to host the ISO

Besides building the VyOS ISO, we will also need to arrange web hosting for it. Vultr’s custom ISO support is based around us initially hosting the custom ISO image, providing Vultr with a download URL for it, and then Vultr imports it from there to its storage.

Install web server software.
```
apt -y install nginx
```
Allow inbound http access through the host-based firewall.
```
ufw allow http
```
Copy the new ISO image into the base content directory for the web server software. Substitute your actual ISO filename in place of vyos-1.3-2024-01-07-amd64.iso.
```
cp ~/vyos-1.3-2024-01-07-amd64.iso /var/www/html/
```
The ISO file should now be web accessible, via the build instance. To validate, work up the access URL using the IP address of your build instance (that shows up in your Vultr portal) and the hostname of the ISO file from step 3. Download the ISO file using a web browser.
```
http://[your vyos-build instance's IP]/[your ISO filename]
```

Use Vultr portal to pull the ISO into the Vultr account

In the Vultr portal, under Products > Orchestration > ISOs, select Add ISO.
Paste the URL for your ISO image being hosted by your build instance on Vultr
Click the Upload button. You should see an “ISO downloading” status.
After a while, navigate back to Productions > Orchestration > ISOs. You should now see your ISO available.

Save a copy of the ISO and destroy the build instance

At some point we will want this same VyOS ISO for use elsewhere, and the Vultr portal will not give us back a copy. Let’s make sure to have retrieved a full copy of the ISO from the build instance to some safekeeping location (e.g. Downloads directory).
Now that the VyOS ISO build is complete, the build instance is no longer required. Stop the build instance, via Vultr’s portal, and destroy it. Products > Compute > vyos-build > Server Stop, Server Destroy.

Intermission and additional background

So far we have created a VyOS ISO image, which is a VyOS live CD environment and installer. It is, in a nutshell, bootable VyOS installation media.

Bootable installation media is a major way for baremetal servers to be OS-installed and remains a viable option for installing virtual machines, too. Drawbacks of this method include the use of a live person to drive an OS installation wizard, which forecloses online provisioning, the subtle inconsistencies that result, and the extensive amount of time the package-by-package installation process can take. While mitigations exist for these drawbacks, no matter how much engineering is added, unavoidably this server provisioning strategy involves a ton of moving parts.

Image-based provisioning has emerged as a standard in enterprises for eliminating OS software installation from the provision-time process. Instead, an OS install process is completed just once, on a workbench. A snapshot is then taken of the installed system to serve as a base (or “golden”) image from which additional servers of the same type will be cloned. Cloud instances and larger VM fleets under modern hypervisors are deployed almost exclusively using this strategy.

Let’s convert our ISO to a Vultr snapshot so that provisioning can happen in a modern way.

Deploy a second Cloud Instance (the “template instance”) via Vultr portal

In your Vultr portal, under Products > Compute, select Deploy > Deploy New Server.
Fill out the form to specify details about your new instance.
1. Cloud Compute > Regular Performance (AMD or Intel) server is fine.
2. Upload ISO > select your VyOS 1.3 ISO
3. Select an instance type with at least 1 GB RAM
4. Specify a hostname e.g. vyos-template.
5. Click Deploy Now, under Products > Compute, watch for instance startup

Access the template instance by its virtual console and install VyOS

Find your instance in the Vultr portal at Products > Compute > vyos-template
At the right, open the three-dot menu and select the option to View Console
Once the virtual console opens, you should notice a login prompt:
Log into the console using the default credentials:
```
u: vyos
p: vyos
```

Execute these few “show configuration commands” commands one at a time, observe output:

show configuration commands | match hw-id
show configuration commands | match host-name
show configuration commands | match name-server

Enter configuration mode:
```
configure
```

Based on the output from step 5 above, work up and execute corresponding commands to delete each of those configuration items. “set” becomes “delete” for each item:

delete interfaces ethernet eth0 hw-id '56:00:04:b7:fe:d4'
delete system host-name 'vyos-template'
delete system name-server '108.61.10.10'
delete system name-server 'eth0'

Commit those configuration changes:
```
commit
```
Save those configuration changes:
```
save
```
Exit from configuration mode:
```
exit
```
Execute the VyOS install-to-disk command and take the defaults (just hit Enter) up until the “Continue: (Yes/No) [No]” prompt:
```
install image
```
This prompt you must explicitly respond to with “Yes” to confirm the wipe/repartition of the virtual HDD:
```
Yes
```
Resume taking the defaults (just hit Enter) until you are prompted about the vyos password. This is asking you to assign a new password for the vyos user that will carry over into the snapshot image.
```
1mYarqCY3MHbE69     # example, pick your own!
```
The installation is wrapping up now. Just hit Enter. Install completes & normal prompt returns:
It is best practice to log out of any server’s console once you are finished using it, to avoid leaving a shell prompt that an unexpected person may otherwise stumble onto. Following best practices for proper exit and clean shutdown even during the decommissioning process is a sign of respect, and covers all bases in case plans change.
```
exit
```

Snapshot the template instance

Products > Compute, find the vyos-template image, open its three-dot menu, and let’s power the instance off for good measure. Select the option Server Stop.
From Products > Compute, again drill into the vyos-template instance and, on the Snapshots tab, use the option to take a snapshot.
Click the Take Snapshot button. You should see a snapshot in progress result
Products > Orchestration > Snapshots, watch for the snapshot to become available after a fair while

Destroy the template instance

Once the shapshot is available, the vyos-template cloud instance is no longer needed.
Destroy that cloud instance now. Products > Compute > vyos-template > Server Destroy.

Validation

It is important to validate your work. For example, after deploying a backup solution, test restoring from it. After configuring an alert related to PSU redundancy, pull one of the redundant PSUs. Does the alert come through? If you have engineered a failover mechanism, think about how you might trigger it in order to validate the solution.

In this case, we need to test-deploy a VyOS instance to be sure it comes up cleanly and looks good. And this is just a quick sanity check. Fuller checks and acceptance tests will be performed as part of an actual Proof of Concept to be covered in subsequent posts.

In your Vultr portal, under Products > Compute, select Deploy > Deploy New Server.
Fill out the form to specify details about your new instance.
1. Cloud Compute > Regular Performance (AMD or Intel) server is fine.
2. Snapshot > Select your vyos-template snapshot
3. Select an instance type with at least 1 GB RAM
4. Specify a hostname e.g. vyos-test-1
5. Deploy Now
Once you see your vyos-test-1 instance running in Vultr portal, wait a few minutes for the system to complete booting and for cloud-init to have a chance to initialize the configuration. Then, use any SSH client to check it out. If you encounter problems with SSH access, fall back to the virtual console to investigate.
Once you’re done testing, go ahead and destroy your test server. Keeping track of test/dev resources that have been allocated to you, or that you have spun for yourself, and returning or deleting them when no longer needed is a good practice and will set you apart in most workplaces.

Credits

Official VyOS build documentation – https://docs.vyos.io/en/equuleus/contributing/build-vyos.html
Helped me, copied some steps too – https://wiki.gbe0.com/networking/vyos/docker-build