TOC
GNMI for Network Telemetry
For those readers who build and manage Network Telemetry stacks, GNMI (GRPC Network Management Interface) is probably one of the hottest topics being discussed right now, and for good reason. The industry has relied far too long on vendor specific mechanisms for obtaining telemetry data from network devices. Its been a viscious cycle of vendors heavily promoting their own telemetry stacks thereby increasing the dependency on those stacks while further de-incentivizing them to work on anything collaborative and standardized for the benefit of end customers. Thankfully, the recent push from our good friends at Google et.al has gotten some of the major vendors (Juniper, Arista, Cisco, Nokia etc.) to put some real effort towards implementing GNMI on their platforms.
However, just because vendors support GNMI does not mean things will automatically work like magic out of the box ! Vendor implementations of GNMI are still very much in their infancy, although some vendors have a more robust implementation than others. Although GNMI can be used as a unified way of managing a network device holistically for tasks like fetching operational data and setting configuration parameters, using GNMI for telemetry data is a good starting point to discover its true benefits. This means data pertaining to the main indicators of network observability such as interface and routing protocol statistics. Thanks to the beauty of Protobuf encoding and HTTP/2 streaming, this mechanism is blazing fast compared to things like Netconf and SNMP. This data is also modelled according to YANG specifications and for vendors supporting OpenConfig models, you get unified telemetry straight out of the box which translates to easier monitoring and dashboarding.
This blog post will attempt to highlight some pratical aspects to consider when using GNMI’s streaming telemetry to set up a Network telemetry stack. It assumes some basic knowledge of GNMI and GRPC. Although there are plenty of resources available on these topics online, the best resource is the GNMI Specification itself.
To GET or to Subscribe ?
GNMI supports both a GET operation as well as a Subscribe operation so it may sometimes be confusing to determine which mechanism to adopt for quering telemetry data from devices. GET is best reserved for operations that involves using the obtained data for semi real-time management tasks such as configuration management, testing frameworks and other forms of automation. For example you can query the operational state of an interface using GET, “marshal” that data into a Go struct and then use the struct for writing network tests. On the other hand, Subsribe is more inherently suited to telemetry streaming. Network devices can periodically stream real-time operational statistics thereby eliminating the drawbacks of traditional polling such as lack of scalability and precision. This fits perfectly into a monitoring application involving time series.
The GNMI spec also allows different subscription modes such as SAMPLE
, ON_CHANGE
and TARGET_DEFINED
. For most cases, this is best left at the default value of SAMPLE
unless the vendor specifically supports ON_CHANGE
for certain paths (for e.g. it may be useful to create an alerting pipeline based on BGP states changing.)
Not all implementations are created equal
Although the GNMI specification aims to provide a standardized interface for operational data, implementations across vendors can and will differ when it comes to practical use cases. The specification allows some flexibility in how GNMI servers ( vendor devices) can structure the data that is sent in response to a subscribe or get request.
As an example, Arista chooses to return every leaf value for the same path in a different GNMI Notification
message as illustrated below for the openconfig path /interfaces/interface/state/counters
(fetched via gnmic
and dumped as JSON for clarity):
{
"source": "1.1.1.1:50010",
"subscription-name": "default-1640129309",
"timestamp": 1635813813518825447,
"time": "2021-11-02T00:43:33.518825447Z",
"updates": [
{
"Path": "interfaces/interface[name=Ethernet1]/state/counters/in-broadcast-pkts",
"values": {
"interfaces/interface/state/counters/in-broadcast-pkts": 0
}
}
]
}
{
"source": "1.1.1.1:50010",
"subscription-name": "default-1640129309",
"timestamp": 1635813813518825447,
"time": "2021-11-02T00:43:33.518825447Z",
"updates": [
{
"Path": "interfaces/interface[name=Ethernet49/1]/state/counters/in-discards",
"values": {
"interfaces/interface/state/counters/in-discards": 0
}
}
]
}
{
"source": "1.1.1.1:50010",
"subscription-name": "default-1640129309",
"timestamp": 1635813813518825447,
"time": "2021-11-02T00:43:33.518825447Z",
"updates": [
{
"Path": "interfaces/interface[name=Ethernet49/1]/state/counters/in-errors",
"values": {
"interfaces/interface/state/counters/in-errors": 0
}
}
]
}
You can see here that each counter value is sent as part of a separate GNMI Notification. Compare this with NXOS that sends all counter values (leafs) in the same Notification message encoded as multiple path elements with a common prefix:
{
"source": "2.2.2.2:50001",
"subscription-name": "default-1640130441",
"timestamp": 1640130550620000000,
"time": "2021-12-21T23:49:10.62Z",
"prefix": "interfaces/interface[name=eth1/1]",
"updates": [
{
"Path": "state/counters/in-pkts",
"values": {
"state/counters/in-pkts": 9188610
}
},
{
"Path": "state/counters/in-octets",
"values": {
"state/counters/in-octets": 950152356
}
},
{
"Path": "state/counters/in-unicast-pkts",
"values": {
"state/counters/in-unicast-pkts": 9149409
}
},
<--snip-->
]
}
{
"source": "2.2.2.2:50001",
"subscription-name": "default-1640130441",
"timestamp": 1640130550620000000,
"time": "2021-12-21T23:50:10.62Z",
"prefix": "",
"updates": [
]
}
Both are acceptable mechanisms of sending the requested data according to the GNMI specification, although the NXOS way definitely is more user friendly and makes more logical sense. After all, why not group all counters for the same interface into the same notification message with a common path prefix, especially since the protocol allows you to do so ?! Then there can also be cases where vendors deviate away from what the protocol specifies. Notice the response from the NXOS box contains an empty notification message with no updates. Turns out that Cisco decided to implement its own “keepalive” mechanism by sending empty messages periodically. Perhaps a useful feature to have, but nowhere to be found in the GNMI spec. This kind of stuff would be ok if you were to use an off the shelf product which takes care of parsing these updates for you, but in a lot of client implementations this would be need to be accounted for.
Test, test and test some more
The last thing you would want is to set up a sophisticated telemetry stack based on GNMI only to find out that the data returned is not what you expect or has key pieces of information missing. Use open source tools like gnmic
to test out all the GNMI paths across all vendors that you wish to collect in production. Chances are rather high that you will find inconsistencies in the data returned that will force you to adapt your tooling to accomodate each special snowflake. Be prepared to file plenty of bug reports !
Think holisticially and plan ahead
When trying to adopt GNMI, you need to make sure you consider the full end to end picture.
- What GNMI Paths do you want to subscribe to for specific telemetry data and how do you make that configurable on a per device basis ?
- How do you want the returned data to be encoded ?
- How do you expect to parse the received data and translate it into a format that is suitable for a Time Series Database (TSDB) such as Prometheus ?
- What values from the returned data should actually end up inside of the TSDB and how should they be represented inside of the TSDB ?
- Does the vendor implementation return all the data that is relevant to your monitoring needs ? These are all valid questions that would need to be answered for a proper production deployment. Be prepared to use a combination of open-config models as well as vendor specific models in order to get all the data you need. There is also a high chance that you would still need to use traditional monitoring methods (Netconf/REST etc.) side by side to get data that has not yet been ported over to GNMI.
Dont worry too much about YANG
It is defintely important to understand the fact that telemetry data that is returned from the devices is modelled in YANG format. What does this imply when it comes to building a telemetry stack using GNMI ? Do you really need to memorize the YANG specification, including stuff like containers, leafs and leaf-lists?
From a server (network device) perspective, this is imperative to know since it dictates the way the data would be represented (modelled) before it is encoded. From a client’s perspective, the YANG data model is useful in figuring out the paths ( or XPATHS ) that you would use to query specific values along the YANG tree. For example if you want to know the in-errors
value for a specific interface, you would traverse the path /interfaces/interface[name=Eth1]/state/counters/in-errors
. Once the data is returned, the client needs to understand how the data is encoded more than how it is modelled, since it needs to parse the returned data and convert it into a format that is meaningful (perhaps to a backend datastore). For example if JSON encoding was requested, the client may implement a JSON parser such as JSONPATH to query fields in the returned JSON. Of course this changes if you intend to use GNMI for things like configuration modelling in which case some YANG knowledge would be required.
Be prepared to write code
Although there are a few high quality open source tools that allow you to quickly get a GNMI stack up and running, there is usually a large gap that needs to be filled - one of integrating these tools into your business logic or business requirements. How do you integrate into a Source of Truth for device inventory or a credentials management system for device credentials and certificates ? Is it ok to simply subscribe to a telemetry stream and freely dump all possible values returned into your TSDB ? Needless to say, backend resources are not free (nor cheap)! How do you filter out or specify data that is truly meaningful to you from an observability standpoint ? These kind of tasks will either require plugins that are pre-created or some amount of coding effort.
Pay special attention to failures and redundancy
GNMI relies on GRPC which uses a persistent long-running TCP (HTTP/2) connection to the targets for subscribe operations. As a result, its important to ensure that network failures and device unreachability is prompty detected. For a subscribe operation, the client has no way of rapidly detecting an underlying TCP connection failure if no messages are being sent (often relying on the underlying system TCP timeout which can be as high as 30 minutes). This is in contrast to traditional polling operations where failures can be easily detected via timeouts. Things like GRPC keepalives can help mitigate this. Similarly, using clustering or grouping mechanisms on the client side ensures continuos telemetry in the event of client node failures.
GNMI is a major step towards unifying the network management interfaces across different vendors. Most vendor GNMI implementations, while still in their infancy, are ok enough that customers now have a realistic chance of creating a true unified telemetry and network management stack. However, the real benefits can only be reaped when all the major vendors collaborate towards future development and improvements of the standard while pledging to keep ease of adoption and usage in mind !
comments powered by Disqus