AIOps During COVID-19

Use data and machine learning to anticipate hardware failures and software upgrades to improve capacity and performance remotely.



I had the opportunity to speak with Ross Ackerman, Senior Director of Active IQ Data Science and Enablement and Marty Mayer, Director of Product Management at NetApp.


AIOps has received a lot of attention during COVID-19 with enterprises looking to do more with less while their employees are working from home.


Storage owners have a lot going on making it tough to thrive but data science can help on several fronts: managing service requests, documentation, blogs, training, experience; monitoring and troubleshooting; and provisioning and configuring storage.


Active IQ is a digital advisor that uses AIOps to simplify and automate the proactive care and optimization of your infrastructure environment to improve its health and availability. This provides actionable intelligence for optimal storage health and simplified management.


Active IQ exposes risk factors and prevents problems before they affect the enterprise. It uncovers and predicts systems reaching capacity or performance limits so you can stay ahead of growth. Use prescriptive guidance and estimated actions to reduce the time spent on day-to-day operations. Users have confidence in knowing the benefits and effectiveness of software upgrades prior to installation.


The strategy is to use data and science for insights, guidance, and action. Community-wisdom driven AIOps provides actionable insights into an Ansible playbook from which automation can be executed. Active IQ receives telemetry data from hundreds of thousands of systems every day. Predictive analytics and machine learning (ML) algorithms identify applications to improve system health and availability. Insights, guidance, and action are delivered through the web user interface, mobile apps, API services, and Ansible playbook.


As enterprises needed more hands and help during COVID-19, there was also the need for more capacity and performance. NetApp changed their ML models to anticipate the need for faster performance. During the first four days on the pandemic lock-down, NetApp talked to 100% of their high capacity customers and mitigated 25% of capacity issues. They’ve also been helping clients transition to the cloud more quickly.


Work from home has forced transitions to the cloud and VDI to take place more quickly. Get to a VDI footprint quickly to support 100% remote work. Know what performance and capacity issues are forthcoming. NetApp has seen a 30% increase in the use of digital tools to monitor capacity and use.


NetApp tech support teams have COVID-19 war rooms to listen to customer support calls. They’ve been working from home and have adopted best practices for camaraderie and teamwork by building collaboration pods, pairing engineers to work together in an attempt to recreate the office environment where groups of people work together to solve customer problems.


Listening to customers has resulted in improvements to the interface with new workflows and better user experience (UX). The customizable dashboard reflects systems and information most important to users. Wellness cards show categorized and prioritized “best next actions” that will have the greatest positive impact on environmental health. The inventory card provides an overview of the environment with links to system-level details. Planning cards reflect what’s ahead that could impact the budget. The upgrade cards show recommended software upgrades and kick off a new workflow that makes it easy to plan and execute an upgrade. Valuable insights card summarizes the impact of the support contract.


Customers are using more digital tools. Account managers are talking to customers and suggesting ways for them to be more digitally-driven and accelerate their move to the cloud including how to make changes and monitor remotely.


The mobile app enables users to view storage environment details, proactive recommendations, customer support content and enables one-click engagement with NetApp for capacity additions and renewals.


By using predictive risk and reaction in an automated fashion, users can predict hardware failure for replacement and necessary software failure adjustments. Everyone is learning how to use telemetry to determine if the hardware has failed without being onsite. If it has failed, NetApp sends the correct part so that it can be fixed in a single visit. The software side is automatically monitored and Ansible is used to automate the management of hundreds and thousands of controllers.


NetApp has been able to help their clients deal with COVID-19 due to the experience they’ve gained over the years dealing with other disasters like 9/11, hurricanes, and earthquakes.


Drop Me a Line, Let Me Know What You Think

© 2020 by Tom Smith | ctsmithiii@gmail.com | @ctsmithiii