From self-driving cars to medical assistants, businesses have found a multitude of ways to leverage machine learning and artificial intelligence to develop smart and responsive products. However, I am often intrigued to notice that the application of machine learning at the infrastructure almost non existing. For instance, while we have models to predict weather patterns, predicting hardware requirements for the next purchase cycle still happens on excel sheets. Or how often have you seen the usage of machine learning to determine optimal data models? AI2 is my vision of an adaptive infrastructure that uses artificial intelligence (and machine learning, data science, etc) to build an adaptive intelligent and robust infrastructure that can evolve with the changing nature of the business. Below are some challenges and opportunities for AI2.
Challenges for AI2:
It is not that the usage of machine learning in the infrastructure domain doesn’t exist, but it has not been embraced enough and for the right reasons. The biggest challenge is, how do you deal with wrong predictions. You laugh if Siri or Google home doesn’t understand your voice command. But if the machine learning model at the infrastructure level makes a wrong decision, it can cause catastrophic failures. For instance, if an intelligent load balancer (more than a random assigner) starts assigning all the requests to a single machine can bring the whole website/service down. There can be many reasons for the wrong decision but, unlike the usage of machine learning for end-user products, usage of machine learning at the infrastructure level can be disastrous.
There is another challenge related to resolution latency. Siri can evolve. If it got the wrong result, you think it’s your problem, or you retry. Data scientists and machine learning engineers working behind Siri will observe these re-attempts and eventually use the data to make Siri better. Infrastructure engineers don’t have the same luxury of time. If the infrastructure is down, they have to have to fix it right now. Given the time pressure, it’s better to have a deterministic system or a system for which you can easily trace the thought process. Machine learning models often are black-box and extremely difficult to debug. Hence, from an engineering standpoint, it’s better to have tradeoff inefficiencies with reliability and the ability to quick resolutions.
Working at Uber, I did realize that not all infrastructure problems can be solved using machine learning. At the same time, I found many problems where machine learning provides the right solution. Below is a run-down list of some of the problems:
Application of ML/AI In Data Infrastructure:
Data Quality: There is no debate about how valuable the data is to a company. Machine learning solutions rely on good quality data to generate reasonable predictions. But rarely is there infrastructure level effort to ensure the quality of all the data. More often than not, it is left up to a data engineer and/or machine learning engineers to determine the quality of data. You might want to read over here how at Uber, my team solved this problem
Data Model: Having a good data model is extremely important to get the necessary speed and agility to analyze the data and find valuable insights. However, building a data model is often left upto a data engineer with no help from the system. However, machine learning techniques can be used to analyze SQL queries and extract patterns such as which columns from different tables used together and to be combined during ETL. Historical SQL query patterns can be leveraged in many other ways also. See following posts on we used historical SQL queries to identify database management.
Hot/Warm/Cold Segmentation: To minimize store cost, typically heavily used data is stored on low latency disks and vice-versa. However, the criteria to determine hot, warm, or cold data is often based on the recency of data. For instance, one might come up with a rule that classifies and thereby place the last one week data in hot storage, from the last one week to the last one quarter in warm storage and any older data in cold storage. However, not all data has the same usage, and a general rule as above is not an optimal solution. Using access logs and SQL queries, one can quickly identify usage patterns and build a predictive model to determine the usage of different data products over time. Using such a model can help in two ways. First, it can lead to more savings by reducing the amount of data that needs to be stored in a hot storage system. Second, it can improve the query performance of many queries that require a longer range of data and thereby have to pull from a slow warm or cold storage system continually.
Intelligent scheduling: Many tasks such as (when to train machine learning model or to generate reports) are scheduled in an adhoc manner. But one can potentially identify the times when the adhoc query load on the cluster is minimal and identify optimal time for running such schedule tasks.
Application of ML/AI In Infrastructure:
Stage Rollout: CI/CD (Continuous Integration and Continuous Deployment) is the new mantra in many companies. However, this creates a challenge for site reliability engineers. Ideally, you want to detect a bad deployment before it’s being deployed and hence you have tons of unit/integration tests. Nevertheless, not all bugs can be detected at unit/integration test and therefore you have dashboards that engineers often monitor as they are deploying an improvement. At Uber, we looked at the different metrics that engineers often use as an early indicator to determine any issues with the deployment. Thereafter, we leveraged our stage rollout mechanism to detect service degradation automatically. It was interesting that as the sample as using t-test onto existing infrastructure now saves thousands of dollars for engineers in engineering productivity.
Determine Machine Type: While selecting a machine, an engineer has to think of several parameters: CPU, memory, network bandwidth, GPU, RAM, etc. Why not let data science drive these decisions and optimize over time. The engineer can provide initial guidelines (or in Data science term prior probability) but based on usage pattern across the above parameter, an intelligent system can be built to optimize for selecting the right type of machine for a given service.
Where To Start:
The above list is not exhaustive. It is mainly to spark some ideas on how ML/AI can be effectively utilized in the infrastructure domain. A lot of this work is being already done in many big companies. However, what’s missing is a comprehensive strategy towards moving to AI2. Most of these efforts are being made in an adhoc manner in separate pockets of the company. In order to truly embrace ML/AI in the infrastructure domain we need a comprehensive strategy, the one that starts with logging data.
One of my biggest challenges leading the infrastructure data science team at Uber has always been a lack of data. Unlike machine learning engineers working on the product side of the business that has the luxury of petabytes of beautifully modeled data, the biggest challenge our team found was a lack of data. Either the data wasn’t being logged or if it’s logged, it cannot be joined to other parts of the infrastructure due to lack of a common identifier. The vision of AI2 can only be realized once we start thinking of data as first-class citizens.
There are many more things that have to come along, but that might be a topic for some other time.