When working with Virtual Machines (VM’s) on Azure that have been deployed using a Linux Operating System (Ubuntu, Debian, Red Hat etc.), you would be forgiven for assuming that the experience would be rocky when compared with working with Windows VM’s. Far from it - you can expect an equivalent level of support for features such as full disk encryption, backup and administrator credentials management directly within the Azure portal, to name but a few. Granted, there may be some difference in the availability of certain extensions when compared with a Windows VM, but the experience on balance is comparable and makes the management of Linux VM’s a less onerous journey if your background is firmly rooted with Windows operating systems.
To ensure that the Azure platform can effectively do some of the tasks listed above, the Azure Linux Agent is deployed to all newly created VM’s. Written in Python and supporting a wide variety of common Linux distributable OS’s, I would recommend never removing this from your Linux VM; no matter how tempting this may be. The tradeoff could cause potentially debilitating consequences within Production environments. A good example of this would be if you ever needed to add an additional disk to your VM. This task would become impossible to achieve without the Agent present on your VM. As well as having the proper reverence for the Agent, it’s important to keep an eye on how the service is performing, even more so if you are using Recovery Services vaults to take snapshots of your VM and back up to another location. Otherwise, you may start to see errors like the one below being generated whenever a backup job attempts to complete:
It’s highly possible that many administrators are seeing this error at the time of writing this post (January 2018). The recent disclosures around Intel CPU vulnerabilities have prompted many cloud vendors, including Microsoft, to roll out emergency patches across their entire data centres to address the underlying vulnerability. Whilst it is commendable that cloud vendors have acted promptly to address this flaw, the pace of the work and the dizzying array of resources affected has doubtless led to some issues that could not have been foreseen from the outset. I believe, based on some of the available evidence (more on this later), that one of the unintended consequences of this work is the rise of the above error message with the Azure Linux Agent.
When attempting to deal with this error myself, there were a few different steps I had to try before the backup started working correctly. I have collected all of these below and, if you find yourself in the same boat, one or all of them should resolve the issue for you.
First, ensure that you are on the latest version of the Agent
This one probably goes without saying as it’s generally the most common answer to any error. 😁 However, the steps for deploying updates out onto a Linux machine can be more complex, particularly if your primary method of accessing the machine is via an SSH terminal. The process for updating your Agent depends on your Linux version - this article provides instructions for the most common variants. In addition, it is highly recommended that the auto-update feature is enabled, as described in the article.
Verify that the Agent service starts successfully.
It’s possible that the service is not running on your Linux VM. This can be confirmed by executing the following command:
service walinuxagent start
Be sure to have your administrator credentials handy, as you will need to authenticate to successfully start the service. You can then check that the service is running by executing the ps -e command.
Clear out the Agent cache files
The Agent stores a lot of XML cache files within the /var/lib/waagent/ folder, which can sometimes cause issues with the Agent. Microsoft has specifically recommended that the following command is executed on your machine after the 4th January if you are experiencing issues:
sudo rm -f /var/lib/waagent/*.[0-9]*.xml
The command will delete all “old” XML files within the above folder. A restart of the service should not be required. The date mentioned above links back to the theory suggested earlier in this post that the Intel chipset issues and this error message are linked in some way, as the dates seem to tie with when news first broke regarding the vulnerability.
If you are still having problems…
Read through the entirety of this article, and try all of the steps suggested - including digging around in the log files, if required. If all else fails, open a support request directly with Microsoft.
Hopefully, by following the above steps, your backups are now working again without issue 🙂