Top 10 Practices to Operationalize Your Data Science Projects in the Real World
Introduction
In data science projects, the derivation of business value follows something akin to the Pareto Principle, where the vast majority of the business value is generated not from the planning, the scoping, or even from producing a viable machine learning model. Rather, business value comes from the final few steps: operationalization of that project. More often than not, there is a disconnect between the worlds of development and production. Some teams may choose to re-code everything in an entirely different language while others may make changes to core elements, such as testing procedures, backup plans, and programming languages.
Operationalizing analytics products could become complicated as different opinions and methods vie for supremacy, resulting in projects that needlessly drag on for months beyond promised deadlines. Successfully building an analytics product and then operationalizing it is not an easy task — it becomes twice as hard when teams are isolated and playing by their own rules. Recently, I have been reading and learning more and more about the unique challenges of deploying machine learning models into production, as I observed this from my summer internship. I want to use this post to share the top 10 practices to operationalize your data science projects in the real world.
1 - Consistent Packaging and Release
In the process of operationalization, there are multiple workflows: some internal flows correspond to production while some external or referential flows relate to specific environments. Moreover, data science projects are composed of not only code but also code for data transformation, configuration, and schema for data, public/internal referential data.
That’s why, to support the reliable transport of code and data from one environment to the next, they need to be packaged together. If you do not support the proper packaging of code and data, the end result is inconsistent code and data during operationalization. These inconsistencies are particularly dangerous during training or when applying a predictive model. Without proper packaging, the deployment in production can pose significant challenges.
The first step toward consistent packaging and release for operationalization is to establish a versioning tool, such as Git, to manage all of the code versioning within your product. The next step is to package the code and data. You can create packaging scripts that generate snapshots in the form of a ZIP file for both code and data inside the script; these should be consistent with the model (or model parameters) that you need to ship. Then you can deploy that ZIP file to production. Lastly, be vigilant of situations when the data file size is too large (e.g., > 1GB). In these scenarios, you need to snapshot and version the required data files in storage.
2 - Continuous Retrain of Models
It is critical to implement an efficient strategy for the re-training, validation, and deployment of models. The process needs to follow an internal workflow loop, be validated, and then passed on to production (with an API to log real outcomes). In data science projects, predictive models need to be updated regularly because, in a competitive environment, models need to be continuously enhanced, adjusted, and updated. Furthermore, the environment and the underlying data always changes.
The cost of not deploying a re-trained model could be huge. In a typical real-life situation, a scoring model AUC could degrade by 0.01 per week due to the natural drift of input data — remember, Internet user behavior changes along with its related data! This means that the 0.05 hard performance optimization that was painstakingly tuned during the project setup could disappear within a few weeks.
The solution to the re-training challenge lies in the data science production workflow. This means that you need to implement a dedicated command for your workflow that does the following: (1) Re-trains the new predictive model candidate; (2) Re-scores and re-validates the model (this step produces the required metrics for your model); and (3) Swaps the old predictive model with the new one.
Regarding implementation, the re-train/re-score/re-validate steps should be automated and executed every week. The final swap is then manually executed by a human operator that performs the final consistency check. This approach provides a good balance between automation and reduced re-train cost while maintaining the final consistency check.
3 - From A/B Testing to Multivariate Optimization
The purpose of A/B testing different models is to be able to evaluate multiple models in parallel and then comparing expected model performance to actual results. Offline testing is not sufficient when validating the performance of a data product. In use cases such as credit scoring and fraud detection, only real-world tests can provide the actual data output required. Offline tests are simply unable to convey real-time events, such as credit authorizations. Furthermore, a real-world production setup may be different from your actual setup. As mentioned above, data consistency is a major issue that results in misaligned productions. Finally, if the underlying data and its behavior are evolving rapidly, then it will be difficult to validate the models fast enough to cope with the rate of change.
There are 3 levels of A/B testing that can be used to test the validity of models: (1) Simple A/B testing, (2) Multi-armed bandit testing, and (3) Multi-variable armed bandit testing with optimization. The first, simple A/B testing, is required for most companies engaged in digital activities while the latter is used primarily in advanced, competitive real-time use cases (e.g., real-time bidding/advertising and algorithmic trading).
4 - Functional Monitoring
Functional monitoring is used to convey the key functionality of the business model’s performance to the business sponsors/owners. From a business perspective, functional monitoring is critical because it provides an opportunity to demonstrate the end-results of your predictive model and how it impacts the product. The kind of functional information that can be conveyed is variable and depends largely on the industry and use case. Some applications of functional monitoring based on industry or functional need include:
Fraud - The number of predicted fraudulent events, the evolution of the prediction’s likelihood, the number of false-positive predictions, and rolling fraud figures.
Churn Reduction - The number of predicted churn events, key variables for churn prediction, and the efficiency of marketing strategies towards churners.
Pricing - Key variables of the pricing model, pricing drift, pricing variation over time, pricing variation across products, the evolution of margin evaluations per day/year, and average transformation ratios.
To ensure efficient functional monitoring, knowledge must be constantly shared and evangelized throughout an organization at every opportunity. A successful communication strategy lies at the heart of any effective organization - which typically combines multiple channels:
Channel for the quick and continuous communication of events, such as new model in production; outliers in production; drop or increase in model performance over the last 24 hours, etc.
E-mail based channel with a daily report of key data, such as a subject with core metrics; top n customers matching specific model criteria; three model metrics, etc.
A real-time notification platform, such as Slack, is a popular option that provides flexible subscription options to stakeholders. If building a monitoring dashboard, visualization tools such as Tableau and Qlik are popular as well.
5 - IT Environment Consistency
The smooth flow of the modeling process relies heavily on the existence of a consistent IT environment during development and production. Modern data science commonly uses technologies such as Python, R, Spark, Scala, along with open-source frameworks/libraries, such as H2O, scikit-learn, and MLlib.
In the past data scientists used technologies that were already available in the production environment, such as SQL databases, JAVA, and .NET. In today’s predictive technology environment, it is not practical to translate a data science project to older technologies like SQL, JAVA, and .NET — doing so incurs substantial re-write costs. Consequently, 80% of companies involved with predictive modeling use newer technologies such as Python and R.
Putting Python or R into production poses its own set of unique challenges in terms of environment and package management. This is the case due to a large number of packages typically involved; data science projects rely on an average of 100 R packages, 40 Python packages, and several hundred Java/Scala packages (most of which are behind Hadoop dependencies). Another challenge is maintaining version control in the development environment; for example, scikit-learn receives a significant update about twice a year.
Fortunately, there are multiple options available when establishing a consistent IT environment. You can use the built-in mechanisms in open source distributions (e.g., virtualenv, pip for Python) or rely on 3rd party software (e.g., AnacondaTM for Python). AnacondaTM is becoming an increasingly popular choice amongst Python users, with one-third of our respondents indicating usage. For Spark, Scala, and R, a vast majority of the data science community is relying solely on open source options. You can also use a build from the source system (e.g., pip for Python) or a binary mechanism (e.g., wheel). In the scientific community, binary systems are enjoying increased popularity. This is partly due to the difficulty involved in building an optimized library that leverages all of the capabilities of scientific computing packages, such as NumPy. Finally, you can rely on a stable release and common package list (in all of your systems) or build a virtual environment for each project. In the former, IT would rather maintain a common list of “trusted” packages and then push those packages to software development. In the latter, each data project would have its own dedicated environment. Remember that the first significant migration or new product delivery may require you to maintain several environments in order to support the transition.
6 - Roll-Back Strategy
A roll-back strategy is required in order to return to a previous model version after the latest version has been deployed. For example, a model may show a 1% or 2% drop in performance after a new release. Without a functional roll-back plan, your team may face an existential crisis the first time something goes wrong with the model. A roll-back plan is like an insurance policy that provides a second chance in the production environment.
A successful roll-back strategy must include all aspects of the data project, such as transformation code, data, software dependencies, and data schemas. The roll-back will need to be executable by users who may not be trained in predictive technologies, so it must be established as an accessible and easy-to-use procedure that could be implemented by an IT Administrator. Roll-back strategies must be tested for usage in a test environment and be accessible in both development and production environments.
7 - Robust Data Flow
Preparing for the worst is part of intelligent strategizing; in the world of predictive analytics, this means having a robust failover strategy. After all, robust data science applications must rely on failover and validation procedures in order to maintain stability. A failover strategy’s job is to integrate all of the events in the production system, monitor the system in case the cluster is doing poorly, and immediately alert IT if the job is not working. The question that needs to be asked is, how do you script a process that would rerun or recover in case of failure?
In traditional business intelligence systems, ETL (extract - transform - load) provides some failover mechanisms, but the failover and validation strategies depend on the intrinsic knowledge of the application. In addition, ETL technologies do not connect well to Python, R, predictive models, and Hadoop. Some levels of access are possible, but it is unlikely that the level of detail required for reliable scripting exists. For example, when running on Hadoop, it is common to leverage the underlying MD5 and hash facilities in order to check file consistency and store/manage the workflow. This capability is typically not easy to do with ETL. Without a proper failover strategy, your data and analytics workflow will fail. The result will be a loss of credibility for using a data science approach in your IT environment. It is important to dedicate time and attention to the creation of your failover strategy, as they are notoriously difficult to perfect the first time around.
Formulating a failover strategy in a big data workflow presents some unique challenges, mostly due to the sheer volume of the data involved. It’s not feasible to take a “rebuild” approach, as there is just too much information to do this efficiently. Given this, a big data workflow must be “state-aware,” meaning that it must make decisions based on a previously calculated state. As expected, ETL methods are typically not capable of encoding this kind of logic.
Be Parallel: Some processing can be parallelized at the workflow level, as opposed to the cluster, map-reduce, and Spark level. As the product evolves, it is likely that the number of branches will grow - using a parallel methodology helps to keep your system fast.
Intelligent Re-Execution: This simply means that data is automatically re-updated after a temporary interruption in data input, such as a late update or temporarily missing data. For example, your big data workflow may retrieve daily pricing data via FTP; your workflow combines this data with existing browser and order data in order to formulate a pricing strategy. If this 3rd party data is not updated, the pricing strategy can still be created using existing up-to-date data… but ideally, the data would be re-updated when the missing data becomes available.
User Interface: Graphically conveying a workflow enables users to more fully understand, and investigate, the overall progress of the workflow. At some point, a textual interface, or raw logs, reach their limit in terms of being able to describe the big picture. When this happens, an easy-to-use Web-based UI is the best option.
8 - Auditing
Auditing, in a data science environment, is being able to know what version of each output corresponds to the code that was used to create it. In regulated domains, such as healthcare and financial services, organizations must be able to trace everything that is related to a data science workflow. In this context, organizations must be able to:
Trace any wrongdoing down to the specific person who modified a workflow for malicious purposes;
Prove that there is no illegal data usage, particularly personal data;
Trace the usage of sensitive data in order to avoid data leaks;
Demonstrate quality and the proper maintenance of the data flow.
Failure to comply with auditing requirements, particularly in highly-regulated sectors, can have a profound impact on smooth business continuity. Regulatory-sensitive organizations may run the risk of heavy fines and/or the loss of highly-coved compliance status. Non-regulated companies still must meet auditing requirements in order to understand exactly what’s going on with their data and workflows, especially if they are being compromised. The ramifications of not implementing an auditing strategy are typically felt the most when a data science practice moves from the arena of experimentation to actual real-world production and critical use cases.
9 - Performance and Scalability
Performance and scalability go hand-in-hand: as scalability limits are tested (more data/customers/processes), performance needs to meet or exceed those limits. Strategically, the challenges lie in being able to create an elastic architecture; the kind of environment that can handle significant transitions (e.g., from 10 calls per hour to 100k per hour) without disruption. As you push your data science workflow into production, you need to consider appropriate increases in your production capability.
Volume Scalability: What happens when the volume of data you manage grows from a few gigabytes to dozens of terabytes?
Request Scalability: What happens when the number of customer requests is multiplied by 100?
Complexity Scalability: What happens when you increase the number of workflows, or processes, from 1 to 20?
Team Scalability: Can your team handle scalability-related changes? Can they cooperate, collaborate, and work concurrently?
Obviously, there is no silver bullet to solve all scalability problems at once. Some real-world samples, however, may help to illustrate the unique challenges of scalability and performance:
Overnight Data Overflow: Multiple dependent batch jobs that last 1 or 2 hours tend to eventually break the expected timespan, effectively running throughout the night and into the next day. Without proper job management and careful monitoring, your resources could quickly be consumed by out-of-control processes
Bottlenecks: Data bottlenecks can pose a significant problem in any architecture, no matter how many computing resources are used. Regular testing can help to alleviate this issue
Logs and Bins: Data volume can grow quickly, but at the vanguard of data growth is logs and bins. This is particularly true when a Hadoop cluster or database is full — when searching for a culprit, always check the logs and bins first as they’re typically full of garbage.
10 - Sustainable Model Lifecycle Management
We can simplify the journey from a prototyping analytics capability to robust productized analytics with the following steps: (1) Deploying models and entire workflows to the production environment in a fast and effective manner; (2) Monitoring and managing these models in terms of drift, and retraining them either regularly or according to a predefined trigger; and (3) Ensuring that the models in production continue to serve their purpose as well as possible given changes in data and business needs. This last point is one that most organizations haven’t struggled with or even really encountered, but it’s vital to keep in mind now because sustaining the lifecycle of models in production is the price of successfully deploying and managing them.
Model management is often concerned with the performance of models, and the key metrics are generally related to the accuracy of scored datasets. But the usefulness of a model is measured in terms of business metrics -- that is if a model has excellent accuracy, but it has no business impact, how could it be considered useful? An example could be a churn prediction model, which accurately predicts churn but provides no insight into how to reduce that churn. Even with measures of accuracy, sustainability becomes an issue. Regular manual checks for drift, even if conducted monthly and in the most efficient manner, will soon become unwieldy as the number of models that need to be checked multiplies. When you add monitoring for business metrics, the workload and complexity are even more daunting.
And finally, data is constantly shifting. Data sources are being changed, new ones are added, and new insights develop around this data. This means that models need to be constantly updated and refined in ways that simple retraining doesn’t address, and this is where the bulk of your team’s effort on sustainability will need to be focused.
To manage the lifecycle of models in a sustainable way, as well as to extend the lifecycle of these models, you need to be able to:
Manage all of your models from a central place, so that there is full visibility into model performance. Have a central location where you measure and track the drift of models via an API, and to the fullest extent possible provide for automated retraining and updating of these models;
Build web-apps and other tools to evaluate models against specific business metrics, so that everyone from the data scientists designing the models to end-users of analytics products is aligned on the goals of the models; and
Free up the time of data scientists and data engineers to focus on making models better and not only on addressing drift and lagging performance of existing models.
Conclusion
The ultimate success of a data science project comes down to contributions from individual team members working together towards a common goal. As can be seen from the topics discussed, “effective contribution” goes beyond specialization in an individual skill-set. Team members must be aware of the bigger picture and embrace project-level requirements, from diligently packaging both code and data to creating Web-based dashboards for their project’s business owners. When all team members adopt a “big picture” approach, they are able to help each other complete tasks outside of their comfort zone.
Data science projects can be intimidating; after all, there are a lot of factors to consider. In today’s competitive environment, individual silos of knowledge will hinder your team’s effectiveness. Best practices, model management, communications, and risk management are all areas that need to be mastered when bringing a project to life. In order to do this, team members need to bring adaptability, a collaborative spirit, and flexibility to the table. With these ingredients, data science projects can successfully make the transition from the planning room to actual implementation in a business environment.