White Paper: How to Automate Network Operating System Device Upgrades

Introduction

Network Device OS Upgrades are a required function in any modern enterprise. The OS can be affected by known or unknown bugs, security vulnerabilities, or more. The operator may also wish to upgrade the OS in order to activate a new feature. Whatever the reason is, with a fixed form factor device with a single CPU (or supervisor), we can expect some sort of outage due to the reload process during upgrades.

Here is a common experience for all network operators. A device upgrade is scheduled and all application owners are informed about the planned outage. The operator connects to the device via SSH and performs the file copy, changes the bootloader command to point to the new image, and issues the reboot command. Immediately all of the phones in the NOC begin ringing. It seems that the unfortunate operator typed .16 instead of .15 when setting up the SSH session, and has mistakenly rebooted the wrong device which had a large amount of traffic flowing through it at that moment. In addition, the device returned to operation after the reboot with new default settings in the configuration and also rejected several existing commands that were critical for service. The Ethernet management port cannot be reached, and the console port is not responding. The operator now has to grab a laptop and console cable and literally run down to the datacenter as fast as possible and connect to the device at 9600bps over a terminal emulator to manually check the large configuration by hand. As he or she runs through the building, they can hear shouting and complaints from every open door and cubicle. Like the device itself, the business is “down hard”.

The example in this story is what we often refer to as a “resume generating event”. But it is not the operator’s fault, humans make mistakes. Typos occur in every few sentences. Not only was the operator not protected from making changes to devices that were in service, but there was no pre or post validation that the upgrade would be successful.

RECOMMENDATION 1:
Use AOS Maintenance Mode (Drain/Undrain) to remove all application traffic from the devices you wish to upgrade. Following the successful drain, change the deploy mode to Ready, which will remove the Service Config from the device (effectively eliminating it as a router in the topology). Placing the device into the Ready state will also eliminate the possibility of anomalies when the device reloads.

Even though network device upgrades are nicely packaged as a single file, the upgrade process is frequently problematic, as a result of the number of services that can theoretically be affected by upgrading a single device. For example, without adequate planning and organization, the upgrade of a core router can cause an interruption for all businesses at once.

Network operators want a simple process that is guaranteed to work, with a higher level workflow that can manage simultaneous upgrades as well as upgrades across multiple vendor types. Since most vendors have a different procedure for the upgrade/downgrade process (POAP, ZTP, ONIE), that can be a somewhat tall order.

AOS supports Device Operating System (DOS) Upgrades for managed switches, allowing the operator to upgrade devices directly from the AOS Server within a consistent workflow process.