The Best Open Source Machine Learning Frameworks

In this article, we present what the author rates as the top eight open source machine learning frameworks.
Learning may be defined as the process of improving one’s ability to
perform a task efficiently. Machine learning is another sub-field of
computer science, which enables modern computers to learn without being
explicitly programmed. Machine learning has basically evolved from
artificial intelligence via pattern recognition and computational
learning theory. Machine learning explores the area of algorithms, which
can make high end predictions on data. In recent times, machine
learning has been deployed in a wide range of computing tasks, where
designing efficient algorithms and programs becomes rather difficult,
such as email spam filtering, optical character recognition, search
engine improvement, digital image processing, data mining, etc.
Tom M. Mitchell, renowned computer scientist and professor at Carnegie
Mellon University, USA, defined machine learning as: “A computer program
is said to learn from experience E with respect to some class of tasks T
and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.”
Machine learning tasks are broadly classified into three categories,
depending on the nature of the learning ‘signal’ or ‘feedback’ available
to a learning system.
- Supervised learning is regarded as a machine learning task of inferring a function from labelled training data. In supervised learning, each example is a pair consisting of an input object (vector) and a desired output value (supervisory signal).
- Unsupervised learning: This is regarded as the machine learning task of inferring a function to describe hidden structures from unlabelled data. It is closely related to the problem of density estimation in statistics.
- Reinforcement learning is an area of machine learning that is linked to how software agents take actions in the environment so as to maximise some notion of cumulative reward. It is applied to diverse areas like game theory, information theory, swarm intelligence, statistics and genetic algorithms. In machine learning, the environment is formulated as a Markov decision process (MDP) due to dynamic programming techniques.
The application of machine learning to diverse areas of computing is
gaining popularity rapidly, not only because of cheap and powerful
hardware, but also because of the increasing availability of free and
open source software, which enable machine learning to be implemented
easily. Machine learning practitioners and researchers, being a part of
the software engineering team, continuously build sophisticated
products, integrating intelligent algorithms with the final product to
make software work more reliably, quickly and without hassles.
There is a wide range of open source machine learning frameworks
available in the market, which enable machine learning engineers to
build, implement and maintain machine learning systems, generate new
projects and create new impactful machine learning systems.
Let’s take a look at some of the top open source machine learning frameworks available.
Advertisement
Apache Singa
The Singa Project was initiated by the DB System Group at the National
University of Singapore in 2014, with a primary focus on distributed
deep learning by partitioning the model and data onto nodes in a cluster
and parallelising the training. Apache Singa provides a simple
programming model and works across a cluster of machines. It is
primarily used in natural language processing (NLP) and image
recognition. A Singa prototype accepted by Apache Incubator in March
2015 provides a flexible architecture of scalable distributed training
and is extendable to run over a wide range of hardware.
Apache Singa was designed with an intuitive programming model based on
layer abstraction. A wide variety of popular deep learning models are
supported, such as feed-forward models like convolutional neural
networks (CNN), energy models like Restricted Boltzmann Machine (RBM),
and recurrent neural networks (RNN). Based on a flexible architecture,
Singa runs various synchronous, asynchronous and hybrid training
frameworks.
Singa’s software stack has three main components: Core, IO and Model.
The Core component is concerned with memory management and tensor
operations. IO contains classes for reading and writing data to the disk
and the network. Model includes data structures and algorithms for
machine learning models.
Its main features are:
- Includes tensor abstraction for strong support for more advanced machine learning models
- Supports device abstraction for running on varied hardware devices
- Makes use of cmake for compilation rather than GNU autotool
- Improvised Python binding and contains more deep learning models like VGG and ResNet
- Includes enhanced IO classes for reading, writing, encoding and decoding files and data
The latest version is 1.0.
Website: http://singa.apache.org/en/index.html
Shogun
Shogun was initiated by Soeren Sonnenburg and Gunnar Raetsch in 1999 and
is currently under rapid development by a large team of programmers.
This free and open source toolbox written in C++ provides algorithms and
data structures for machine learning problems. Shogun Toolbox provides
the use of a toolbox via a unified interface from C++, Python, Octave,
R, Java, Lua and C++; and can run on Windows, Linux and even MacOS.
Shogun is designed for unified large-scale learning for a broad range of
feature types and learning settings, like classification, regression,
dimensionality reduction, clustering, etc. It contains a number of
exclusive state-of-art algorithms, such as a wealth of efficient SVM
implementations, multiple kernel learning, kernel hypothesis testing,
Krylov methods, etc.
Shogun supports bindings to other machine learning libraries like
LibSVM, LibLinear, SVMLight, LibOCAS, libqp, VowpalWabbit, Tapkee, SLEP,
GPML and many more.
Its features include one-time classification, multi-class
classification, regression, structured output learning, pre-processing,
built-in model selection strategies, visualisation and test frameworks;
and semi-supervised, multi-task and large scale learning.
The latest version is 4.1.0.
Website: http://www.shogun-toolbox.org/
Apache Mahout
Apache Mahout, being a free and open source project of the Apache
Software Foundation, has a goal to develop free distributed or scalable
machine learning algorithms for diverse areas like collaborative
filtering, clustering and classification. Mahout provides Java libraries
and Java collections for various kinds of mathematical operations.
Apache Mahout is implemented on top of Apache Hadoop using the MapReduce
paradigm. Once Big Data is stored on the Hadoop Distributed File System
(HDFS), Mahout provides the data science tools to automatically find
meaningful patterns in these Big Data sets, turning this into ‘big
information’ quickly and easily.
- Building a recommendation engine: Mahout provides tools for building a recommendation engine via the Taste library– a fast and flexible engine for CF.
- Clustering with Mahout: Several clustering algorithms are supported by Mahout, like Canopy, k-Means, Mean-Shift, Dirichlet, etc.
- Categorising content with Mahout: Mahout uses the simple Map-Reduce-enabled naïve Bayes classifier.
The latest version is 0.12.2.
Website: https://mahout.apache.org/
Apache Spark MLlib
Apache Spark MLlib is a machine learning library, the primary objective
of which is to make practical machine learning scalable and easy. It
comprises common learning algorithms and utilities, including
classification, regression, clustering, collaborative filtering,
dimensionality reduction as well as lower-level optimisation primitives
and higher-level pipeline APIs.
Spark MLlib is regarded as a distributed machine learning framework on
top of the Spark Core which, mainly due to the distributed memory-based
Spark architecture, is almost nine times as fast as the disk-based
implementation used by Apache Mahout.
The various common machine learning and statistical algorithms that have been implemented and included with MLlib are:
- Summary statistics, correlations, hypothesis testing, random data generation
- Classification and regression: Supports vector machines, logistic regression, linear regression, naïve Bayes classification
- Collaborative filtering techniques including Alternating Least Squares (ALS)
- Cluster analysis methods including k-means and Latent Dirichlet Allocation (LDA)
- Optimisation algorithms such as stochastic gradient descent and limited-memory BGGS
The latest version is 2.0.1.
Website: http://spark.apache.org/mllib/
TensorFlow
TensorFlow is an open source software library for machine learning
developed by the Google Brain Team for various sorts of perceptual and
language understanding tasks, and to conduct sophisticated research on
machine learning and deep neural networks. It is Google Brain’s second
generation machine learning system and can run on multiple CPUs and
GPUs. TensorFlow is deployed in various products of Google like speech
recognition, Gmail, Google Photos and even Search.
TensorFlow performs numerical computations using data flow graphs. These
elaborate the mathematical computations with a directed graph of nodes
and edges. Nodes implement mathematical operations and can also
represent endpoints to feed in data, push out results or read/write
persistent variables. Edges describe the input/output relationships between nodes. Data edges carry dynamically-sized multi-dimensional data arrays or tensors.
Its features are listed below.
- Highly flexible: TensorFlow enables users to write their own higher-level libraries on top of it by using C++ and Python, and express the neural network computation as a data flow graph.
- Portable: It can run on varied CPUs or GPUs, and even on mobile computing platforms. It also supports Docker and running via the cloud.
- Auto-differentiation: TensorFlow enables the user to define the computational architecture of predictive models combined with objective functions, and can handle complex computations.
- Diverse language options: It has an easy Python based interface and enables users to write code, and see visualisations and data flow graphs.
The latest version is 0.10.0.
Website: www.tensorflow.org
Oryx 2
Oryx 2 is a realisation of Lambda architecture built on Apache Spark and
Apache Kafka for real-time large scale machine learning. It is designed
for building applications and includes packaged, end-to-end
applications for collaborative filtering, classification, regression and
clustering.
Oryx 2 comprises the following three tiers.
- General Lambda architecture tier: Provides batch, speed and serving layers, which are not specific to machine learning.
- Specialisation on top which, in turn, provides machine learning abstraction to hyperparameter selection, etc.
- End-to-end implementation of the same standard machine learning algorithms as an application (ALS, random decision forests, k-means) on top.
Oryx 2 consists of the following layers of Lambda architecture as well as connecting elements.
- Batch layer: Used for computing new results from historical data and previous results.
- Speed layer: Produces and publishes incremental model updates from a stream of new data.
- Serving layer: Receives models and updates, and implements a synchronous API, exposing query operations on results.
- Data transport layer: Moves data between layers and takes input from external sources.
The latest version is 2.2.1.
Website: http://oryx.io/
Accord.NET
Accord.NET is a .NET open source machine learning framework for
scientific computing, and consists of multiple libraries for diverse
applications like statistical data processing, pattern recognition,
linear algebra, artificial neural networks, image and signal processing,
etc.
The framework is divided into libraries via the installer, compressed
archives and NuGet packages, which include Accord.Math,
Accord.Statistics, Accord.MachineLearning, Accord.Neuro, Accord.Imaging,
Accord.Audio, Accord.Vision, Accord.Controls, Accord.Controls.Imaging,
Accord.Controls.Audio, Accord.Controls.Vision, etc.
Its features are:
- Matrix library for an increase in code reusability, and gradual change of existing algorithms over standard .NET structures.
- Consists of more than 40 different statistical distributions like hidden Markov models and mixture models.
- Consists of more than 30 hypothesis tests like ANOVA, two-sample, multiple-sample, etc.
- Consists of more than 38 kernel functions like KVM, KPC and KDA.
The latest version is 3.1.0.
Website: www.accord-framework.net
Amazon Machine Learning (AML)
Amazon Machine Learning (AML) is a machine learning service for developers. It has many visualisation tools and wizards for creating high-end sophisticated and intelligent machine learning models without any need to learn complex ML algorithms and technologies. Via AML, predictions for applications can be obtained using simple APIs without using custom prediction generation code or complex infrastructure.
AML is based on simple, scalable, dynamic and flexible ML technology
used by Amazon’s ‘Internal Scientists’ community professionals to create
Amazon Cloud Services. AML connects to data stored in Amazon S3,
Redshift or RDS, and can run binary classification, multi-class
categorisation or regression on this data to create models.
The key contents used in Amazon ML are listed below.
- Datasources: Contain metadata associated with data inputs to Amazon ML.
- ML models: Generate predictions using the patterns extracted from the input data.
- Evaluations: Measure the quality of ML models.
- Batch predictions asynchronously generate predictions for multiple input data observations.
- Real-time predictions synchronously generate predictions for individual data observations.
Its key features are:
- Supports multiple data sources within its system.
- Allows users to create a data source object from data residing in Amazon Redshift – the data warehouse Platform as a Service.
- Allows users to create a data source object from data stored in the MySQL database.
- Supports three types of models: binary classification, multi-class classification and regression.
Website: https://aws.amazon.com/machine-learning/
By Dr Anand Nayyar – January 17, 2017
https://opensourceforu.com/2017/01/best-open-source-machine-learning-frameworks/
GPDR and pushy sales people on holiday
I returned from holiday this summer and met a friend who asked where I had been. I told him I visited Istanbul and had taken advantage of two other national trips, which were super deals and in comparison no more than the price of second class seat on a national-rail train, one way ticket. One of these was a five day tour of the south mediterranian coast to include Antalya and Bodrum. The other, a three day trip to Cappadocia. Both of these trips included inter city flights, half board hotels and tour guides supervising the group professionally on designated coaches making the trips extremely convenient from collection points to and from the airport. The pretty much all inclusive five day trip including eturn flights from Istancbul cost just 139 pounds. The Capadocia tour was even less.
My friend said he had always fancied Turkey as a holiday destination but he said he was particularly avert to overwheming attitudes of certain sales people trying to draw tourists in to their shops to buy things that they didn’t really want. I understood eactly what he was talking about as I had myself experienced that difficult scenario upon former trips abroad. He was right, the last thing a tourist wants is to be pestered by salesmen with poor English trying to lure them off the street and into their shops to sell them souveneirs that they didn’t want or even perhaps they had already bought and they were no longer in shopping mode but just sigh seeing or trying to get to places they had planned to see.
Thankfully, this tradition of pestering tourists and in some cases playing with their emotions to buy goods they didn’t have any interest in. In recent years the Turkish government has not only outlawed street beggars but men stood outside independent stores drawing customers in are no longer a problem either.
Back in the UK and recently my combination boiler, which is over 15 years old broke down. I had already exhausted all possible means of support to keep the boiler running as the manufacturer with whom I had a maintenance contract had alreay written to me to say they would no longer be supporting this boiler and our maintenance contract was terminated accordingly. All third party service engineers who had visited to service the boiler since also advised that I should simply give up on it and they would be happy to install a new one, at a cost that equates to almost 300 hours of work at minimum wage rates… In other words, for a lot of money!
I have one way or another managed to keep my boiler going for at least five years since the industry totally gave up on it and and recently came across another failure mode. I had turned off the Central Heating functions at the start of summer as the weather settled into summer temperatures so heating was not needed. At the end of summer temperatures dropped again so I tried to turn on the Central Heating and it simply didn’t work even though hot water was still consistent and working as well as expected.
I got out my installation manual for the boiler and followed the diagnotics flow charts. This helped me to narrow it down to a component. Now all I had to do was find a spare part and replace it. It was at this point that I noticed a similarity between the pushy sales people on holiday in the past, right here at the cradle of advanced technologies and high standards of living.
So, it turns out that the primative Artificial Intelligence(AI) used in modern search engines used across Europe and in this instance I’m talking about eBay, I put in my search criteria of the part number and not many hits since this is an old part number and no doubt the original manufacturer had sold off that part of the business to companies who had changed part numbers to suit their business, so the information I was looking for was buried and difficult to find. Obviously, eBay like many other on-line retail platforms are geared to sell ‘something’ even if not what I am looking for. So, in an effort to eliminate the possibility of changed part numbers and other possible boilers where the same component might have been used, I began entering a description of the part instead of the original part number.. For example a permutation of keywords including “Worcester Bosh boiler Cenral Heating Temperature sensor” and relentlessly the site spewed out dozens of options ranging from complete boilers to hundreds of parts that I am not at all interested.
This where the similarities of pushy sales people on holiday dawned on me. This is not the first instance where I have specified exactly what I am looking for to a search engine and in return, it has suggested a thousand and one other things that I have no interest in buying. And of course nowdays it is not uncommon for cookies to kick in so search engine service providers fire off our seach behaviour online to thousands of ‘associates’ who all take advantage of our geolocation, keywords used and throw useless information at us about what they think we should be looking for and ultimately buying. I wondered if this was any different to pushy sales people who approach us while we are trying to enjoy our holidays abroad to sell us useless products that we have no interest in whatsoever.
It turns out that my friend is actively looking for a destination for a holiday and I hear he has settled for Majorca – again, presumably because they will not be encouraged to spend money on things they do not want there even though back at home, where ever that is, during our daily routines to source items of interst, we are persistently bothered by virtual sales entities to stray away from what we are looking for on a daily basis and this doesn’t seem to bother us, yet the thought of a human equivalent that might approach us while we are on holiday forms the basis of where we will not be going on holiday.
