A Friendly Introduction to Open Source Data Science for Business Leaders
Introduction
The field of data science has been steadily gaining a foothold in the corporate sector over the past decade and is now an integral part of business strategy for some of the world’s most successful companies. But as the scope of enterprise data science changes, so too have the tools data scientists are using to solve complex problems, from building models to identify and retain high-value customers to creating highly effective product recommendation engines.
Proprietary data science solutions, once the mainstay of enterprise data science, are being eclipsed by open source projects like R, Spark, and TensorFlow. While there are many reasons for this shift, a major one is that open source tools are available to anyone — and with that comes endless opportunities for collaboration and contribution. No longer are open source tools considered unreliable or limited; instead, they have been embraced by the data science community at large and built out to the point that they provide measurable value, even in an enterprise capacity.
The Origin of Open Source for Data Science
It could be argued that data science as we know it today would not exist without open-source software. Certainly, the emergence of the Apache Hadoop data processing framework and its ecosystem of complementary open source projects changed the game, enabling companies to store and process data that was previously ignored due to the cost and complexity of storing, processing and analyzing it in traditional proprietary data warehousing environments. Apache Hadoop, along with the Apache Spark in-memory data processing framework – and their associated stream processing and machine learning projects – have been instrumental in lowering the cost of storing and processing the large volumes of data that are the fuel for data science experiments and operational applications.
Open source has been instrumental in driving DevOps, the confluence of application development and IT operations, driving innovation in agile development and automation, including source code management and collaboration (e.g., Git), as well as container orchestration and management (e.g., Kubernetes). Open source is also a key enabler for enterprise data science platforms, which have the potential to deliver more extensive and, in some cases, automated analytics than is possible using business intelligence and other more traditional forms of analysis. Indeed, many enterprise data science platforms embrace a wide range of analytic approaches and techniques including machine learning (unsupervised and supervised), deep learning and transfer learning. Many enterprise data science platforms available on the market today take advantage of open-source software, adding proprietary features atop an open-source code base in order to deliver additional essential features. Furthermore, they draw on open-source software not only for carrying out the actual analysis and attendant data wrangling and cleansing required but for storing and processing data too, using Hadoop and Spark.
Cultural and Technical Advantages of Open Source Data Science
1 - Open Culture
Open source programming languages are loved by programmers and data scientists alike – as the ongoing infatuation with R and increasing popularity of Python for data science attest. Furthermore, open-source development has always fostered a culture of collaboration to develop and improve the related software, enabling innovation to occur in a collective manner by taking advantage of each person’s contribution, which benefits the mindsets of engineers and data scientists. Indeed, open-source offerings have become central to introducing new data science capabilities to market in large part because data scientists and developers are already behind open source development concepts and the spirit of community development actively engendered by it. Finally, open culture has roots in academic and research institutions, which can be transferred into enterprises when academics and researchers decide to pursue careers in commercial corporations.
2 - Open Innovation
The rapid pace of collaborative development means that open-source data science is often the first to deliver innovative capabilities, which enables enterprises to capitalize on cutting-edge technology and techniques. Additionally, the fact that open-source software is freely available means it attracts large communities, which results in more individuals to improve the software, and also fix it if it breaks. Moreover, open-source general-purpose programming languages, such as Python, make for good data science tools because they are already well-used, and their wide-ranging capabilities mean they can be more easily integrated into data pipelines and end-user applications already in use by enterprises.
3 - Free and Freely Available
Open-source offerings enable enterprises to get started on data science projects without investing any money since they don’t require up-front support or licensing costs. Open-source data science, therefore, makes a foray into AI and machine learning economically attractive. Furthermore, it is often easier to recruit experienced individuals with an open-source background because programming languages commonly used for data science, such as R, are taught in academic institutions and are well adopted for research, so they have become the lingua franca for a certain generation of data scientists.
4 - Licensing and Support Considerations
Open-source software can be forked and customized, and open-source data science software is no exception. That places the onus on the individuals using it to ensure they use proper version-control and maintenance practices – an increasingly onerous task when the software is extensively used by large teams in enterprises. This issue is particularly acute in the open-source data science sector because of the plethora of choice, and the fact it is not immediately obvious to what extent a particular offering is supported – or adopted. Additionally, the open-source ethos of reliance on individual contributors and their skill sets can create another type of support issue. When individuals leave an enterprise, they take their skill sets with them, and in doing so, potentially leave the company in the lurch if their replacements do not have the same skills or expertise.
The Open Source Data Science Toolbox
There are numerous open-source tools and frameworks available. While new open source projects are appearing all the time, many are well established and widely adopted.
The Python and R open-source programming languages have been available for decades. Python was conceived in the early 1980s and first implemented in 1989. R first appeared in 1993. The data science capabilities they contain, as well as add-ons for them and their loyal following, have made them strong foundations for enterprise data science.
Open source institutions have played a critical role in nurturing and supporting open-source data science development. The Apache Software Foundation (ASF) is currently incubating MXNet, and also supports Mahout, one of the first machine learning libraries for the Apache Hadoop ecosystem, as well as MLlib, SINGA and Zeppelin. Meanwhile, The Eclipse Foundation is home to DeepLearning4J (DL4J), which is a deep learning developer tool.
Some of these open source projects can be traced to individual technology vendors and service providers. TensorFlow, which is increasingly popular for deep learning, is a prime example, along with scikit-learn, Cognitive Toolkit, Caffe2, and Chainer.
Academic and research institutions have played their part in building out the open-source data science ecosystem. UC Berkeley created the Caffe deep learning framework, while the Apache Spark family of projects, including the MLlib machine learning library, was developed at UC Berkeley’s AMPLab. Similarly, XGBoost originated as a research project at the University of Washington.
Below is a long list of several key open-source data science projects that are popular among academic and industry practitioners:
Anaconda Distribution is an open-source environment for Python-based data science and machine learning development, testing, and training on a single Linux, Windows, or Mac OS X-based machine.
Apache Mahout primarily provides collaborative filtering, clustering, and classification for developer use cases. It works first and foremost with the Apache Spark open-source cluster computing framework, which is recommended as its back-end, although it does support others.
Apache MXNet is a deep learning library for developers. It is primarily designed to accelerate the development and deployment of large-scale neural networks.
Caffe is an open-source deep learning framework. It was designed with an expressive architecture in mind to encourage developers to innovate on it, while also providing speed for high-performance model building for research experimentation purposes, as well as commercial deployment.
Caffe2 has been developed atop the open-source Caffe code base and is primarily designed to make Caffe highly scalable. Caffe2 has Python and C++ APIs so developers can prototype deep learning models and optimize them later in other environments. Caffe2 is also integrated with various mobile application development environments.
Chainer is an open-source, Python-based neural network framework for developers to craft deep learning models. Chainer aims to be flexible by supporting a ‘define-by-run’ neural network definition, which is designed to enable dynamic changes in the neural network. In addition, Chainer enables deep learning model building using Recurrent Neural Networks and Variational Auto-Encoders, which are considered state-of-the-art.
H20 is an in-memory open-source machine learning platform. It is primarily used for exploring and analyzing data, supporting the ability to run fast serialization between nodes and clusters in order to handle large datasets. H20 has a GUI, although data scientists and developers who like to code can do so because it integrates with Python and R.
Keras is an open-source high-level neural network API written in Python and capable of running on TensorFlow and Cognitive Toolkit. Keras aims to provide developers with fast and user-friendly experimentation when prototyping deep learning models mainly involving CNNs or RNNs (or both).
Cognitive Toolkit is an open-source deep-learning toolkit for the creation, training, and evaluation of neural networks. It is designed for developers, data scientists, and students interested in the latest algorithms and techniques.
RapidMiner Studio is an open-source visual workflow designer with data preparation and machine learning capabilities.
scikit-learn is a general-purpose framework for Python developers and data scientists that require out-of-the-box machine learning for classification, regression, clustering, model selection, feature extraction, and other data mining and analysis task. It is built on Python’s NumPy, SciPy, and Matplotlib libraries.
TensorFlow provides developers and data scientists with an open-source software library for high-performance numerical computation involving machine learning. It is frequently used for deep learning and designed to be deployed in a variety of environments including desktops, clusters of servers, mobile platforms, and edge devices.
Torch is an open-source machine learning library, a scientific computing framework, and a scripting language based on the Lua programming language. It is, therefore, designed for multiple use cases. However, at the heart of Lua is a foundational library for deep learning in Python, which Lua developers and data scientists can use to create deep learning models or wrapper libraries in order to simplify neural network creation.
XGBoost is a gradient boosting framework for C++, Java, Python, R, and Julia. It initially became popular among the Kaggle community where it has been used for a large number of competitions, although nowadays, it is used in commercial machine-learning-based data science too.
Conclusions
Open source is a key enabler for enterprise data science, both in terms of the growing ecosystem of open-source tools and the expanding number of complementary enterprise data science platforms that incorporate and build on open source languages and tools. The challenge is identifying which of those tools is relevant and valuable to your business. Assessing the maturity of these projects, grappling with any licensing issues, and making sure your team has the correct skillset to use them are challenges that many companies are now facing.
It is evident that there are a great many paths to open source data science. Whichever route an enterprise takes, what is clear is that open source will continue to drive innovation in data science tools and platforms, as well as enterprise adoption of data science, which will drive innovation in the enterprise use of analytics and the development of intelligent applications.