Hi I'm new at machine learning and therefore looking for a text classification solution. Could one recommend me a nice framework written in java? I thought about using WEKA, but also heard about MALLET. What's better, where are the main differences?
My target is to classify unlabeled text. Therefore I prepared about 18 topics and 100 text for each topic for learning.
What would you recommend to do? Would also appreciate a nice little example or hint of how to proceed.
You have a very minimal text data set, you could use any library - it wouldn't really matter. More advanced options would require more data than you have to be meaningful, so its not an issue worth considering. The simple way text classifications problems are handled is to use a Bag of Words model and a linear classifier. Both Weka and MALLET support this.
Personally, I find Weka to be a pain and MALLET to be poorly documented / out of date when it is, so I use JSAT. There is an example on doing spam classification here.
(bias warning, I'm the author of JSAT).
Since your task is fairly simple and as you mentioned you're new at ML, I'd recommend you to use weka as it is easy to use and has a large user community.
Otherwise here are some General Purpose Machine Learning frameworks in Java that you can have a look at:
Datumbox - Machine Learning framework for rapid development of Machine Learning and Statistical applications
ELKI - Java toolkit for data mining. (unsupervised: clustering, outlier detection etc.)
H2O - ML engine that supports distributed learning on data stored in HDFS.
htm.java - General Machine Learning library using Numenta’s Cortical Learning Algorithm
java-deeplearning - Distributed Deep Learning Platform for Java, Clojure,Scala
JAVA-ML - A general ML library with a common interface for all algorithms in Java
JSAT - Numerous Machine Learning algoirhtms for classification, regresion, and clustering.
Mahout - Distributed machine learning
Meka - An open source implementation of methods for multi-label classification and evaluation (extension to Weka).
MLlib in Apache Spark - Distributed machine learning library in Spark
Neuroph - Neuroph is lightweight Java neural network framework
ORYX - Simple real-time large-scale machine learning infrastructure.
RankLib - RankLib is a library of learning to rank algorithms
RapidMiner - RapidMiner integration into Java code
Stanford Classifier - A classifier is a machine learning tool that will take data items and place them into one of k classes.
WalnutiQ - object oriented model of the human brain
Weka - Weka is a collection of machine learning algorithms for data mining tasks
Source: Awesome Machine Learning
Related
I have started using Julia.I read that it is faster than C.
So far I have seen some libraries like KNET and Flux, but both are for Deep Learning.
also there is a command "Pycall" tu use Python inside Julia.
But I am interested in Machine Learning too. So I would like to use SVM, Random Forest, KNN, XGBoost, etc but in Julia.
Is there a native library written in Julia for Machine Learning?
Thank you
A lot of algorithms are just plain available using dedicated packages. Like BayesNets.jl
For "classical machine learning" MLJ.jl which is a pure Julia Machine Learning framework, it's written by the Alan Turing Institute with very active development.
For Neural Networks Flux.jl is the way to go in Julia. Also very active, GPU-ready and allow all the exotics combinations that exist in the Julia ecosystem like DiffEqFlux.jl a package that combines Flux.jl and DifferentialEquations.jl.
Just wait for Zygote.jl a source-to-source automatic differentiation package that will be some sort of backend for Flux.jl
Of course, if you're more confident with Python ML tools you still have TensorFlow.jl and ScikitLearn.jl, but OP asked for pure Julia packages and those are just Julia wrappers of Python packages.
Have a look at this kNN implementation and this for XGboost.
There are SVM implementations, but outdated an unmaintained (search for SVM .jl). But, really, think about other algorithms for much better prediction qualities and model construction performance. Have a look at the OLS (orthogonal least squares) and OFR (orthogonal forward regression) algorithm family. You will easily find detailed algorithm descriptions, easy to code in any suitable language. However, there is currently no Julia implementation I am aware of. I found only Matlab implementations and made my own java implementation, some years ago. I have plans to port it to julia, but that has currently no priority and may last some years. Meanwhile - why not coding by yourself? You won't find any other language making it easier to code a prototype and turn it into a highly efficient production algorithm running heavy load on a CUDA enabled GPGPU.
I recommend this quite new publication, to start with: Nonlinear identification using orthogonal forward regression with nested optimal regularization
Machine Learning - what a hoot!
I have a little project with which I would like to identify anomalies in unlabeled data. Thus, unsupervised clustering.
However, the sequence of the data is also important, as a single record may not be of interest, but the sequence of records that precede it may make it anomalous.
So I am thinking of building a Recurrent SOM to add the temporal context.
I have trained a few simple Machine Learning Models using Python Graphlab Create, Azure Machine Learning and Encog ML Framework, but Azure does not seem to provide unsupervised clustering and I am leaning towards using Encog.
I have looked at Recurrent Neural Networks in Encog, as well as SOM, but I have no idea how to combine the two. Most of the articles online regarding Feedback/Recurrent SOM Machine Learning are mostly academic.
Are there any good references for doing this with Encog?
A google search found only one good answer for RSOM in Encog: https://github.com/leadtune/encog-java/blob/master/encog-core/src/org/encog/neural/pattern/RSOMPattern.java
Why is there a lot of interest in the NLP and ML community for deep learning?
Why do they need approaches to learn complex non-linear relationships?
I guess the most interesteing things about deep learning is the capability of in an unsupervised way you can learn high level features.
Deep learning neural networks have recently have shown very powerful improvements in tasks in computer vision and NLP compared to some other machine learning methods that have been popular for longer.
At least in acoustic modelling for speech recognition, deep learning helps us get better features when compared to MFCCs.
For an in depth look at deep learning and why it is important and interesting, take a look at my article here http://simonwinder.com/2015/01/what-is-deep-learning/ I have been working on this stuff since the 90s and its bizarre to see it take off so suddenly.
For a novice to machine learning, what are the learning prerequisites to using Apache Mahout in an efficient way?
I know that a committer to Mahout would need calculus, linear algebra, probability and machine learning before they can contribute anything useful. But does a "User" of Apache Mahout need all of this?
I'm asking this because learning/revising all of the above would take me ages..
Mahout In Action provides a good overview of what you need to know to use Mahout.
Typically, scalable machine learning does not require advanced mathematics for use. It may require serious math to develop, but not necessarily to use.
The primary requirement is that you really understand your data and its origins and what you want to do with it. That understanding doesn't have to come all at once and can be developed over time.
Try to Google the topics below:
Programming Collaborative Intelligence
Similarity calculation with vectors
What's the different between cluster and classification.
Any one with sentient analysis experience with liblinear algorithm. Any one have used liblinear-ruby-swig gem?
Please suggest me something to start with.
I used lib linear a lot for other classification not for sentiment analysis
Are you interested in using lib linear or to do sentiment analysis?
For simple sentiment analysis look at
https://chrismaclellan.com/blog/sentiment-analysis-of-tweets-using-ruby
sad_panda gem (https://rubygems.org/gems/sad_panda) is similar to an R library I have used in the past. It has tools for both polarity and emotion classification of text (as "sadness", "anger", "joy", a few others).
There is not much work in ruby for sentiment analysis or machine learning in general. One of the best machine learning library is weka, so you can consider using it with jruby.
That said I have created an entry level gem, I am planning to enhance it by porting some of the weka algorithms in ruby.