How To Move Data Science Into Production

The future of data science is automated processes for quality, speed, and optimized results.

I had the opportunity to hear Michael Berthold, CEO at KNIME provide the introductory keynote at the Virtual KNIME Spring Summit. Michael equated moving data science into production with cooking.


When cooking, you create a new recipe by trying new ingredients in different combinations, amounts, and cooking styles. The recipe becomes part of the cookbook. Millions of cooks try the recipe once a month and a new cookbook, with new recipes, is published every few years.


In data science, the data scientist creates a new analysis integrating new data sources and experimenting with different models and tuning parameters. The optimized model goes into production and business analysts use insights and models daily. There are daily updates to the production model with new requirements and technology.


Common issues with data science deployment include technology breaks, incomplete coverage, and long and inefficient feedback cycles. Data science development involves a mix of tools that may not seamlessly integrate and the deployment of models and gathering of results can involve even more tools. Deployment tools often lack preprocessing and the most up-to-date models. Lastly, testing is not built-in and there is no fully automated and complete deployment.


Creating and productionizing data science has a lot of complex, manual tasks: training data blending, custom data preprocessing, model optimization and training, model application (prediction), copied data preprocessing; and, copied model application (prediction). The issues with all of these tasks include ETL pieces of manual copies, transporting models is nontrivial, and some preprocessing is also a model. This manual process is inefficient and error-prone.


The process needs to be automated with an analytics platform where models are continuously recorded, monitored, and retrained. Workflows combine nodes, where tasks are performed on data, with components that encapsulate complexity and expertise to create a workflow where nodes are combined to model the data flow.


KNIME provides more than 3,000 nodes for depth and functionality including data access (e.g. MySQL, Oracle, SAS, SPSS, et al), big data (e.g. Hive, HDFS, Teradata, Spark, et al), transformation (e.g., Time Series, Java, Python, 3rd Part, et al), analysis and data mining (e.g., ML, DL, R, Python, et al), visualization (e.g., R, Python, JavaScripts, et al), and deployment (e.g., BIRT, JSON, XML, et al).


Data scientists can also mix and match external data, ML/DL libraries, distributed/cloud execution, scripting languages, and reporting and visualization tools to enable performance optimization of the models.


The platform provides a transparent blueprint of the process that can be customized based on the problem you’re trying to solve. There are workflows for model training, monitoring, and updating. Adjustments including new models and updated training algorithms can be deployed instantly.


Takeaways


Automating the production of data science with data analytics, reporting, and integration means:

  • No technology breaks with data science development, deployment of models, and the capture of results.

  • Complete coverage with more than 3,000 modules.

  • A highly efficient feedback cycle with built-in testing and automated deployment which enables model ops, governance, and compliance.

Drop Me a Line, Let Me Know What You Think

© 2020 by Tom Smith | ctsmithiii@gmail.com | @ctsmithiii