
In this article, we present what the author rates as the top eight open source machine learning frameworks.
Learning may be defined as the process of improving one’s ability to
perform a task efficiently. Machine learning is another sub-field of
computer science, which enables modern computers to learn without being
explicitly programmed. Machine learning has basically evolved from
artificial intelligence via pattern recognition and computational
learning theory. Machine learning explores the area of algorithms, which
can make high end predictions on data. In recent times, machine
learning has been deployed in a wide range of computing tasks, where
designing efficient algorithms and programs becomes rather difficult,
such as email spam filtering, optical character recognition, search
engine improvement, digital image processing, data mining, etc.
Tom M. Mitchell, renowned computer scientist and professor at Carnegie
Mellon University, USA, defined machine learning as: “A computer program
is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.”
Machine learning tasks are broadly classified into three categories,
depending on the nature of the learning ‘signal’ or ‘feedback’ available
to a learning system.
- Supervised learning is regarded as a machine learning task of inferring a function from labelled training data. In supervised learning, each example is a pair consisting of an input object (vector) and a desired output value (supervisory signal).
- Unsupervised learning: This is regarded as the machine learning task of inferring a function to describe hidden structures from unlabelled data. It is closely related to the problem of density estimation in statistics.
- Reinforcement learning is an area of machine learning that is linked to how software agents take actions in the environment so as to maximise some notion of cumulative reward. It is applied to diverse areas like game theory, information theory, swarm intelligence, statistics and genetic algorithms. In machine learning, the environment is formulated as a Markov decision process (MDP) due to dynamic programming techniques.
The application of machine learning to diverse areas of computing is
gaining popularity rapidly, not only because of cheap and powerful
hardware, but also because of the increasing availability of free and
open source software, which enable machine learning to be implemented
easily. Machine learning practitioners and researchers, being a part of
the software engineering team, continuously build sophisticated
products, integrating intelligent algorithms with the final product to
make software work more reliably, quickly and without hassles.
There is a wide range of open source machine learning frameworks
available in the market, which enable machine learning engineers to
build, implement and maintain machine learning systems, generate new
projects and create new impactful machine learning systems.
Let’s take a look at some of the top open source machine learning frameworks available.
Advertisement
Apache Singa
The Singa Project was initiated by the DB System Group at the National
University of Singapore in 2014, with a primary focus on distributed
deep learning by partitioning the model and data onto nodes in a cluster
and parallelising the training. Apache Singa provides a simple
programming model and works across a cluster of machines. It is
primarily used in natural language processing (NLP) and image
recognition. A Singa prototype accepted by Apache Incubator in March
2015 provides a flexible architecture of scalable distributed training
and is extendable to run over a wide range of hardware.
Apache Singa was designed with an intuitive programming model based on
layer abstraction. A wide variety of popular deep learning models are
supported, such as feed-forward models like convolutional neural
networks (CNN), energy models like Restricted Boltzmann Machine (RBM),
and recurrent neural networks (RNN). Based on a flexible architecture,
Singa runs various synchronous, asynchronous and hybrid training
frameworks.
Singa’s software stack has three main components: Core, IO and Model.
The Core component is concerned with memory management and tensor
operations. IO contains classes for reading and writing data to the disk
and the network. Model includes data structures and algorithms for
machine learning models.
Its main features are:
- Includes tensor abstraction for strong support for more advanced machine learning models
- Supports device abstraction for running on varied hardware devices
- Makes use of cmake for compilation rather than GNU autotool
- Improvised Python binding and contains more deep learning models like VGG and ResNet
- Includes enhanced IO classes for reading, writing, encoding and decoding files and data
The latest version is 1.0.
Website: http://singa.apache.org/en/index.html
Shogun
Shogun was initiated by Soeren Sonnenburg and Gunnar Raetsch in 1999 and
is currently under rapid development by a large team of programmers.
This free and open source toolbox written in C++ provides algorithms and
data structures for machine learning problems. Shogun Toolbox provides
the use of a toolbox via a unified interface from C++, Python, Octave,
R, Java, Lua and C++; and can run on Windows, Linux and even MacOS.
Shogun is designed for unified large-scale learning for a broad range of
feature types and learning settings, like classification, regression,
dimensionality reduction, clustering, etc. It contains a number of
exclusive state-of-art algorithms, such as a wealth of efficient SVM
implementations, multiple kernel learning, kernel hypothesis testing,
Krylov methods, etc.
Shogun supports bindings to other machine learning libraries like
LibSVM, LibLinear, SVMLight, LibOCAS, libqp, VowpalWabbit, Tapkee, SLEP,
GPML and many more.
Its features include one-time classification, multi-class
classification, regression, structured output learning, pre-processing,
built-in model selection strategies, visualisation and test frameworks;
and semi-supervised, multi-task and large scale learning.
The latest version is 4.1.0.
Website: http://www.shogun-toolbox.org/
Apache Mahout
Apache Mahout, being a free and open source project of the Apache
Software Foundation, has a goal to develop free distributed or scalable
machine learning algorithms for diverse areas like collaborative
filtering, clustering and classification. Mahout provides Java libraries
and Java collections for various kinds of mathematical operations.
Apache Mahout is implemented on top of Apache Hadoop using the MapReduce
paradigm. Once Big Data is stored on the Hadoop Distributed File System
(HDFS), Mahout provides the data science tools to automatically find
meaningful patterns in these Big Data sets, turning this into ‘big
information’ quickly and easily.
- Building a recommendation engine: Mahout provides tools for building a recommendation engine via the Taste library– a fast and flexible engine for CF.
- Clustering with Mahout: Several clustering algorithms are supported by Mahout, like Canopy, k-Means, Mean-Shift, Dirichlet, etc.
- Categorising content with Mahout: Mahout uses the simple Map-Reduce-enabled naïve Bayes classifier.
The latest version is 0.12.2.
Website: https://mahout.apache.org/
Apache Spark MLlib
Apache Spark MLlib is a machine learning library, the primary objective
of which is to make practical machine learning scalable and easy. It
comprises common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering,
dimensionality reduction as well as lower-level optimisation primitives
and higher-level pipeline APIs.
Spark MLlib is regarded as a distributed machine learning framework on
top of the Spark Core which, mainly due to the distributed memory-based
Spark architecture, is almost nine times as fast as the disk-based
implementation used by Apache Mahout.
The various common machine learning and statistical algorithms that have been implemented and included with MLlib are:
- Summary statistics, correlations, hypothesis testing, random data generation
- Classification and regression: Supports vector machines, logistic regression, linear regression, naïve Bayes classification
- Collaborative filtering techniques including Alternating Least Squares (ALS)
- Cluster analysis methods including k-means and Latent Dirichlet Allocation (LDA)
- Optimisation algorithms such as stochastic gradient descent and limited-memory BGGS
The latest version is 2.0.1.
Website: http://spark.apache.org/mllib/
TensorFlow
TensorFlow is an open source software library for machine learning
developed by the Google Brain Team for various sorts of perceptual and
language understanding tasks, and to conduct sophisticated research on
machine learning and deep neural networks. It is Google Brain’s second
generation machine learning system and can run on multiple CPUs and
GPUs. TensorFlow is deployed in various products of Google like speech
recognition, Gmail, Google Photos and even Search.
TensorFlow performs numerical computations using data flow graphs. These
elaborate the mathematical computations with a directed graph of nodes
and edges. Nodes implement mathematical operations and can also
represent endpoints to feed in data, push out results or read/write
persistent variables. Edges describe the input/output relationships between nodes. Data edges carry dynamically-sized multi-dimensional data arrays or tensors.
Its features are listed below.
- Highly flexible: TensorFlow enables users to write their own higher-level libraries on top of it by using C++ and Python, and express the neural network computation as a data flow graph.
- Portable: It can run on varied CPUs or GPUs, and even on mobile computing platforms. It also supports Docker and running via the cloud.
- Auto-differentiation: TensorFlow enables the user to define the computational architecture of predictive models combined with objective functions, and can handle complex computations.
- Diverse language options: It has an easy Python based interface and enables users to write code, and see visualisations and data flow graphs.
The latest version is 0.10.0.
Website: www.tensorflow.org
Oryx 2
Oryx 2 is a realisation of Lambda architecture built on Apache Spark and
Apache Kafka for real-time large scale machine learning. It is designed
for building applications and includes packaged, end-to-end
applications for collaborative filtering, classification, regression and
clustering.
Oryx 2 comprises the following three tiers.
- General Lambda architecture tier: Provides batch, speed and serving layers, which are not specific to machine learning.
- Specialisation on top which, in turn, provides machine learning abstraction to hyperparameter selection, etc.
- End-to-end implementation of the same standard machine learning algorithms as an application (ALS, random decision forests, k-means) on top.
Oryx 2 consists of the following layers of Lambda architecture as well as connecting elements.
- Batch layer: Used for computing new results from historical data and previous results.
- Speed layer: Produces and publishes incremental model updates from a stream of new data.
- Serving layer: Receives models and updates, and implements a synchronous API, exposing query operations on results.
- Data transport layer: Moves data between layers and takes input from external sources.
The latest version is 2.2.1.
Website: http://oryx.io/
Accord.NET
Accord.NET is a .NET open source machine learning framework for
scientific computing, and consists of multiple libraries for diverse
applications like statistical data processing, pattern recognition,
linear algebra, artificial neural networks, image and signal processing,
etc.
The framework is divided into libraries via the installer, compressed
archives and NuGet packages, which include Accord.Math,
Accord.Statistics, Accord.MachineLearning, Accord.Neuro, Accord.Imaging,
Accord.Audio, Accord.Vision, Accord.Controls, Accord.Controls.Imaging,
Accord.Controls.Audio, Accord.Controls.Vision, etc.
Its features are:
- Matrix library for an increase in code reusability, and gradual change of existing algorithms over standard .NET structures.
- Consists of more than 40 different statistical distributions like hidden Markov models and mixture models.
- Consists of more than 30 hypothesis tests like ANOVA, two-sample, multiple-sample, etc.
- Consists of more than 38 kernel functions like KVM, KPC and KDA.
The latest version is 3.1.0.
Website: www.accord-framework.net
Amazon Machine Learning (AML)
Amazon Machine Learning (AML) is a machine learning service for developers. It has many visualisation tools and wizards for creating high-end sophisticated and intelligent machine learning models without any need to learn complex ML algorithms and technologies. Via AML, predictions for applications can be obtained using simple APIs without using custom prediction generation code or complex infrastructure.
AML is based on simple, scalable, dynamic and flexible ML technology
used by Amazon’s ‘Internal Scientists’ community professionals to create
Amazon Cloud Services. AML connects to data stored in Amazon S3,
Redshift or RDS, and can run binary classification, multi-class
categorisation or regression on this data to create models.
The key contents used in Amazon ML are listed below.
- Datasources: Contain metadata associated with data inputs to Amazon ML.
- ML models: Generate predictions using the patterns extracted from the input data.
- Evaluations: Measure the quality of ML models.
- Batch predictions asynchronously generate predictions for multiple input data observations.
- Real-time predictions synchronously generate predictions for individual data observations.
Its key features are:
- Supports multiple data sources within its system.
- Allows users to create a data source object from data residing in Amazon Redshift – the data warehouse Platform as a Service.
- Allows users to create a data source object from data stored in the MySQL database.
- Supports three types of models: binary classification, multi-class classification and regression.
Website: https://aws.amazon.com/machine-learning/
By Dr Anand Nayyar – January 17, 2017
https://opensourceforu.com/2017/01/best-open-source-machine-learning-frameworks/











