MLOps: Understanding the Framework and Its Key Challenges
Written on
Overview of MLOps
As Machine Learning (ML) models become more integrated into software systems, the new domain of MLOps—short for ML Operations—has started to take shape. MLOps encompasses practices aimed at ensuring the reliable and efficient deployment and maintenance of ML models in production environments. While there is a consensus that MLOps presents significant challenges, the reasons behind these difficulties often remain ambiguous. This article seeks to clarify MLOps by outlining its typical components across various organizations and ML applications.
MLOps Stack
To kick off, we will examine the MLOps tech stack that Machine Learning Engineers (MLEs) utilize to create and deploy machine learning models. The stack consists of four distinct layers:
- Run Layer: This layer oversees hyper-parameters, data, and experimental runs. A run is essentially a log of an ML or data pipeline's execution. Data at this level is typically managed through data catalogs, model registries, and training dashboards. Examples include Weights & Biases, MLFlow, and AWS Glue.
- Pipeline Layer: Tailored for the development, training, deployment, and oversight of ML models, this layer is more granular than the run layer and details the dependencies among artifacts and computations. Pipelines can execute on-demand or follow a schedule, changing less frequently than runs but more than components. Notable examples include Sagemaker and Airflow.
- Component Layer: This layer is where the actual ML model is constructed, encompassing model training and feature selection. Components are individual computational nodes in a pipeline, often scripts executed within a managed environment. Some organizations maintain a centralized library of common components for reuse. Examples include TensorFlow and Spark.
- Infrastructure Layer: This layer provides the necessary computational resources to store and utilize ML models, including cloud storage and GPU-enabled computing. Changes in this layer are less frequent but carry substantial implications. Key examples include AWS and Google Cloud Platform.
Each layer plays a critical role in managing the ML lifecycle, from execution to infrastructure support. As you delve deeper into the stack, alterations become less common: while training jobs may be conducted daily, modifications to Dockerfiles occur infrequently.
MLOps Tasks
Successful MLEs are essential for operationalizing ML throughout the tech stack, engaging in a continuous loop of four main tasks:
- Data Collection: This entails sourcing data from multiple origins, centralizing it, and performing cleaning operations. Proper labeling of data points is crucial, which can be done internally or outsourced.
- Model Experimentation: MLEs work to refine ML model performance by testing various features and architectures. This process often involves creating new features, altering existing ones, or revising the model architecture itself, with performance assessed using metrics like accuracy.
- Model Evaluation: Following training, a model must be evaluated to confirm its performance meets established criteria, requiring the computation of metrics on previously unseen data. If satisfactory, the model moves to deployment, which includes rigorous reviews and potential A/B testing.
- Model Maintenance: Post-deployment, continuous monitoring of ML pipelines is vital to identify any issues or irregularities. This involves real-time metric tracking, analyzing prediction quality, and addressing failures as they arise.
The ensuing sections will delve into effective strategies for Model Experimentation, Evaluation, and Maintenance employed by MLEs.
Model Experimentation
ML Engineering is inherently experimental and iterative, contrasting with traditional software engineering. Many experiments may not transition to production, emphasizing the need for rapid prototyping and validation. Here are common strategies MLEs adopt to cultivate successful experiment ideas:
- Collaboration and Idea Validation: Effective project ideas often emerge from teamwork with domain experts and data analysts. Collaborative exchanges and inter-team communication serve to refine concepts.
- Iterating on Data: MLEs prioritize data iteration over sole model focus. Enhancing model performance often involves the introduction of new features and iterative data refinements.
- Accounting for Diminishing Returns: ML projects typically follow a staged deployment model, with initial ideas validated offline before broader production deployment, focusing on high-impact experiments early on.
- Preferential Treatment for Small Changes: Following best practices, MLEs aim for small, incremental modifications to the codebase, facilitating quicker reviews and minimizing conflicts.
These strategies enhance the success of ML experiments by fostering collaboration, optimizing data use, prioritizing impactful ideas, and ensuring code quality.
Model Evaluation
Model evaluation processes must adapt to evolving data and business needs, ensuring that subpar models do not reach production. Here are some strategies organizations employ for effective evaluation:
- Dynamic Validation Datasets: Engineers regularly analyze live data failures and update validation datasets to prevent recurrence, addressing shifts in data distribution.
- Standardized Validation Systems: Organizations strive to standardize evaluation processes to maintain consistency and minimize errors stemming from variable criteria.
- Multi-stage Deployment and Evaluation: Many organizations implement multi-stage deployment protocols, assessing models progressively to catch issues early.
- Tying Evaluation Metrics to Product Metrics: Evaluating models based on metrics aligned with product success—like user engagement—ensures their real-world impact is measured effectively.
Collectively, these efforts aim to guarantee that models are thoroughly assessed and aligned with organizational goals prior to production deployment.
Model Maintenance
Maintaining high-performance models necessitates intentional software engineering and organizational practices. Common strategies MLEs use during monitoring and debugging include:
- Creating New Versions: Regularly retraining models on live data helps combat data staleness, with retraining cycles ranging from hourly to monthly.
- Maintaining Old Versions as Fallback Models: To minimize downtime during issues, engineers preserve older or simpler models for quick reversion.
- Maintaining Layers of Heuristics: Models are augmented with rule-based layers to ensure stable predictions, filtering out inaccuracies based on domain knowledge.
- Validating Data Going In and Out of Pipelines: Continuous monitoring of features and predictions is essential, with various validation checks in place to maintain data quality.
- Keeping it Simple: MLEs favor simplicity in models and algorithms to ease post-deployment maintenance.
- Organizational Support: Implementing processes that support MLEs, such as on-call rotations and tracking systems for production bugs, helps sustain model performance.
These strategies collectively assist MLEs in ensuring that production ML pipelines maintain accuracy and reliability.
Conclusions
Within the MLOps tech stack—comprising run, pipeline, component, and infrastructure layers—three common factors significantly influence the success of ML model deployment:
- Velocity: The speed at which MLEs can prototype and develop new models is crucial for efficient hypothesis testing and development cycles.
- Validation: Proactive testing of changes and early identification of issues is essential to minimize costs and disruptions.
- Versioning: Effective management of multiple model versions allows for better query and debugging capabilities, facilitating quick responses to problems.
MLOps tools at each layer should strive to enhance user experience concerning the three Vs: Velocity, Validation, and Versioning. For instance, experiment tracking tools can expedite the iteration process, while feature stores improve model debugging through reproducibility.
Improvements
Despite these best practices, there remain significant areas for enhancement within MLOps. Common pain points include:
- Mismatch Between Development and Production Environments: Issues such as data leakage and inconsistent coding practices hinder deployment efficacy.
- Handling a Spectrum of Data Errors: Engineers face challenges with hard, soft, and drift errors, leading to alert fatigue and difficulties in meaningful data alert creation.
- Unique Nature of ML Bugs: Debugging in ML differs markedly from traditional software engineering, complicating bug categorization.
- Prolonged Multi-Staged Deployments: The lengthy nature of end-to-end ML experimentation may lead to uncertainty and the abandonment of promising projects.
Even More Improvements
Further areas requiring attention include:
- Industry-Classroom Mismatch: A gap exists between academic training and industry needs, leaving practitioners unprepared for real-world challenges.
- Keeping GPUs Warm: The drive to utilize all computational resources often leads to running numerous experiments in parallel rather than focusing on impactful ones.
- Retrofitting Explanations: The pressure to justify successful experiments often results in explanations crafted post hoc, rather than based on initial principles.
- Undocumented Tribal Knowledge: Rapid learning outpaces documentation updates, creating challenges in knowledge sharing.
Addressing these anti-patterns requires educational initiatives, tool advancements, and process improvements to enhance MLOps practices' effectiveness and efficiency.
If you found this article informative, please share it to help others discover it!
Special thanks to Shreya Shankar for conducting the interviews that inspired this article. For more details, refer to the original paper: [Operationalizing Machine Learning: An Interview Study].