1. 程式人生 > >New open-source Machine Learning Framework written in Java

New open-source Machine Learning Framework written in Java

open-source

I am happy to announce that the Datumbox Machine Learning Framework is now open sourced under GPL 3.0 and you can download its code from Github!

What is this Framework?

The Datumbox Machine Learning Framework is an open-source framework written in Java which enables the rapid development of Machine Learning models and Statistical applications. It is the code that currently powers up the Datumbox API. The main focus of the framework is to include a large number of machine learning algorithms & statistical methods and be able to handle small-medium sized datasets. Even though the framework targets to assist the development of models from various fields, it also provides tools that are particularly useful in Natural Language Processing and Text Analysis applications.

What types of models/algorithms are supported?

The framework is divided in several Layers such as Machine Learning, Statistics, Mathematics, Algorithms and Utilities. Each of them provides a series of classes that are used for training machine learning models. The two most important layers are the Statistics and the Machine Learning layer.

The Statistics layer provides classes for calculating descriptive statistics, performing various types of sampling, estimating CDFs and PDFs from commonly used probability distributions and performing over 35 parametric and non-parametric tests. Such types of classes are usually necessary while performing explanatory data analysis, sampling and feature selection.

The Machine Learning layer provides classes can be used in a large number of problems including Classification, Regression, Cluster Analysis, Topic Modeling, Dimensionality Reduction, Feature Selection, Ensemble Learning and Recommender Systems. Here are some of the supported algorithms: LDA, Max Entropy, Naive Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and more.

Datumbox Framework VS Mahout VS Scikit-Learn

Both Mahout and Scikit-Learn are great projects and both of them have completely different targets. Mahout supports only a very limited number of algorithms which can be parallelized and thus use Hadoop’s Map-Reduce framework to handle Big Data. On the other hand Scikit-Learn supports a large number of algorithms but it can’t handle huge amount of data. Moreover it is developed in Python, which is a great language for prototyping and Scientific Computing but not my personal favourite for software development.

The Datumbox Framework sits in the middle of the two solutions. It tries to support a large number of algorithms and it is written in Java. This means that it can be incorporated easier into production code, it can easier be tweaked to reduce memory consumption and it can be used in real time systems. Finally even though currently Datumbox Framework is capable of handling medium-sized datasets, it is within my plans to expand it to handle large-sized datasets.

How stable is it?

The early versions of the framework (up to 0.3.x) were developed in August and September of 2013 and they were written in PHP (yeap!). During May and June 2014 (versions 0.4.x), the framework was rewritten in Java and enhanced with additional features. Both branches were heavily tested in commercial applications including the Datumbox API. The current version is 0.5.0 and it seems mature enough to be released as the first public alpha version of the framework. Having said that, it is important to note that some functionalities of the framework are tested more thoroughly than others. Moreover since this version is alpha, you should expect drastic changes on the future releases.

Why I wrote it and why I open-source it?

My involvement with Machine Learning and NLP dates back to 2009 when I co-founded WebSEOAnalytics.com. Since then I have been developing implementations of various machine learning algorithms for various projects and applications. Unfortunately most of the original implementations were very problem-specific and they could hardly be used in any other problem. In August 2013 I decided to start Datumbox as a personal project and develop a framework that provides the tools for developing machine learning models focusing in the area of NLP and Text Classification. My target was to build a framework that would be reused on the future for developing quickly machine learning models, incorporating it in projects that require machine learning components or offer it as a service (Machine Learning as a Service).

And here I am now, several lines of code later, open-sourcing the project. Why? The honest answer is that at this point, it is not within my plans to go through a “let’s build a new start-up” journey. At the same time I felt that keeping the code on my hard disk in case I need it on the future does not make sense. So the only logical thing to do was to open-source it.