Which datamining tool to use? [closed] - comparison

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Can somebody explain me the main pros and cons of the most known datamining open-source tools?
Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones.
look at this blog post
Can somebody do a fast technical comparison in a small bullet list.
My needs are the following:
It should support classification algorithms (Naive Bayes, SVM, C4.5,
kNN).
It should be easy to implement in Java.
It should have understandable documentation.
It should have reference production projects or use cases working on in.
some additional benchmark comparison if possible.
Thanks!

I would like to say firstly there are pro's and cons for each of them on your list however I would suggest out of your list weka from my personal experience it is incredibly simple to implement in your own java application using the weka jar file and has its own self contained tools for data mining.
Rapid miner seems to be a commercial solution offering an end to end solution however the most notable number of examples of external implementations of solutions for rapid miner are usually in python and r script not java.
Orange offers tools that seem to be targeted primarily at people with possibly less need for custom implementations into their own software but a far easier time with user itneraction, its written in python and source is available, user addons are supported.
Knime is another commercial platform offering end to end solutions for data mining and analysis providing all the tools required, this one has various good reviews around the internet but i havent used it enough to advise you or anyone on the pro's or cons of it.
See here for knime vs weka
Best data mining tools
As i said weka is my personal favorite as a software developer but im sure other people have varying reasons and opinions on why to choose one over the other. Hope you find the right solution for you.
Also per your requirements weka supports the following:
Naivebayes
SVM
C4.5
KNN

I have tried Orange and Weka with a 15K records database and found problems with the memory management in Weka, it needed more than 16Gb of RAM while Orange could've managed the database without using that much. Once Weka reaches the maximum amount of memory, it crashes, even if you set more memory in the ini file telling Java virtual machine to use more.

I recently evaluated many open source projects, comparing and contrasted them with regards to the decision tree machine learning algorithm. Weka and KNIME were included in that evaluation. I covered the differences in algorithm, UX, accuracy, and model inspection. You might chose one or the other depending on what features you value most.

I have had positive experience with RapidMiner:
a large set of machine learning algorithms
machine learning tools - feature selection, parameter grid search, data partitioning, cross validation, metrics
a large set of data manipulation algorithms - input, transformation, output
applicable to many domains - finance, web crawling and scraping, nlp, images (very basic)
extensible - one can send and receive data other technologies: R, python, groovy, shell
portable - can be run as a java process
developer friendly (to some extent, could use some improvements) - logging, debugging, breakpoints, macros
I would have liked to see something like RapidMiner in terms of user experience, but with the underlying engine based on python technologies: pandas, scikit-learn, spacy etc. Preferably, something that would allow moving back and forth from GUI to code.

Related

What are the special features of machine learning source code that different from other kinds of programs?

I just want to know in detail about this question, in view of software engineering. What are the special features of machine learning source code that different from other kinds of program? I only know that, ML programs have to manipulate many complex data sets, and perform many mathematical computations. So, such programs often difficult to debug, or to analyze the execution?
I want to find some ideas of software verification that apply to ML programs, so I want to know the special features

Machine Learning & Big Data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
In the beginning, I would like to describe my current position and the goal that I would like to achieve.
I am a researcher dealing with machine learning. So far have gone through several theoretical courses covering machine learning algorithms and social network analysis and therefore have gained some theoretical concepts useful for implementing machine learning algorithms and feed in the real data.
On simple examples, the algorithms work well and the running time is acceptable whereas the big data represent a problem if trying to run algorithms on my PC. Regarding the software I have enough experience to implement whatever algorithm from articles or design my own using whatever language or IDE (so far have used Matlab, Java with Eclipse, .NET...) but so far haven't got much experience with setting-up infrastructure. I have started to learn about Hadoop, NoSQL databases, etc, but I am not sure what strategy would be the best taking into consideration the learning time constraints.
The final goal is to be able to set-up a working platform for analyzing big data with focusing on implementing my own machine learning algorithms and put all together into production, ready for solving useful question by processing big data.
As the main focus is on implementing machine learning algorithms I would like to ask whether there is any existing running platform, offering enough CPU resources to feed in large data, upload own algorithms and simply process the data without thinking about distributed processing.
Nevertheless, such a platform exists or not, I would like to gain a picture big enough to be able to work in a team that could put into production the whole system tailored upon the specific customer demands. For example, a retailer would like to analyze daily purchases so all the daily records have to be uploaded to some infrastructure, capable enough to process the data by using custom machine learning algorithms.
To put all the above into simple question: How to design a custom data mining solution for real-life problems with main focus on machine learning algorithms and put it into production, if possible, by using the existing infrastructure and if not, design distributed system (by using Hadoop or whatever framework).
I would be very thankful for any advice or suggestions about books or other helpful resources.
First of all, your question needs to define more clearly what you intend by Big Data.
Indeed, Big Data is a buzzword that may refer to various size of problems. I tend to define Big Data as the category of problems where the Data size or the Computation time is big enough for "the hardware abstractions to become broken", which means that a single commodity machine cannot perform the computations without intensive care of computations and memory.
The scale threshold beyond which data become Big Data is therefore unclear and is sensitive to your implementation. Is your algorithm bounded by Hard-Drive bandwidth ? Does it have to feet into memory ? Did you try to avoid unnecessary quadratic costs ? Did you make any effort to improve cache efficiency, etc.
From several years of experience in running medium large-scale machine learning challenge (on up to 250 hundreds commodity machine), I strongly believe that many problems that seem to require distributed infrastructure can actually be run on a single commodity machine if the problem is expressed correctly. For example, you are mentioning large scale data for retailers. I have been working on this exact subject for several years, and I often managed to make all the computations run on a single machine, provided a bit of optimisation. My company has been working on simple custom data format that allows one year of all the data from a very large retailer to be stored within 50GB, which means a single commodity hard-drive could hold 20 years of history. You can have a look for example at : https://github.com/Lokad/lokad-receiptstream
From my experience, it is worth spending time in trying to optimize algorithm and memory so that you could avoid to resort to distributed architecture. Indeed, distributed architectures come with a triple cost. First of all, the strong knowledge requirements. Secondly, it comes with a large complexity overhead in the code. Finally, distributed architectures come with a significant latency overhead (with the exception of local multi-threaded distribution).
From a practitioner point of view, being able to perform a given data mining or machine learning algorithm in 30 seconds is one the key factor to efficiency. I have noticed than when some computations, whether sequential or distributed, take 10 minutes, my focus and efficiency tend to drop quickly as it becomes much more complicated to iterate quickly and quickly test new ideas. The latency overhead introduced by many of the distributed frameworks is such that you will inevitably be in this low-efficiency scenario.
If the scale of the problem is such that even with strong effort you cannot perform it on a single machine, then I strongly suggest to resort to on-shelf distributed frameworks instead of building your own. One of the most well known framework is the MapReduce abstraction, available through Apache Hadoop. Hadoop can be run on 10 thousands nodes cluster, probably much more than you will ever need. If you do not own the hardware, you can "rent" the use of a Hadoop cluster, for example through Amazon MapReduce.
Unfortunately, the MapReduce abstraction is not suited to all Machine Learning computations.
As far as Machine Learning is concerned, MapReduce is a rigid framework and numerous cases have proved to be difficult or inefficient to adapt to this framework:
– The MapReduce framework is in itself related to functional programming. The
Map procedure is applied to each data chunk independently. Therefore, the
MapReduce framework is not suited to algorithms where the application of the
Map procedure to some data chunks need the results of the same procedure to
other data chunks as a prerequisite. In other words, the MapReduce framework
is not suited when the computations between the different pieces of data are
not independent and impose a specific chronology.
– MapReduce is designed to provide a single execution of the map and of the
reduce steps and does not directly provide iterative calls. It is therefore not
directly suited for the numerous machine-learning problems implying iterative
processing (Expectation-Maximisation (EM), Belief Propagation, etc.). The
implementation of these algorithms in a MapReduce framework means the
user has to engineer a solution that organizes results retrieval and scheduling
of the multiple iterations so that each map iteration is launched after the reduce
phase of the previous iteration is completed and so each map iteration is fed
with results provided by the reduce phase of the previous iteration.
– Most MapReduce implementations have been designed to address production needs and
robustness. As a result, the primary concern of the framework is to handle
hardware failures and to guarantee the computation results. The MapReduce efficiency
is therefore partly lowered by these reliability constraints. For example, the
serialization on hard-disks of computation results turns out to be rather costly
in some cases.
– MapReduce is not suited to asynchronous algorithms.
The questioning of the MapReduce framework has led to richer distributed frameworks where more control and freedom are left to the framework user, at the price of more complexity for this user. Among these frameworks, GraphLab and Dryad (both based on Direct Acyclic Graphs of computations) are well-known.
As a consequence, there is no "One size fits all" framework, such as there is no "One size fits all" data storage solution.
To start with Hadoop, you can have a look at the book Hadoop: The Definitive Guide by Tom White
If you are interested in how large-scale frameworks fit into Machine Learning requirements, you may be interested by the second chapter (in English) of my PhD, available here: http://tel.archives-ouvertes.fr/docs/00/74/47/68/ANNEX/texfiles/PhD%20Main/PhD.pdf
If you provide more insight about the specific challenge you want to deal with (type of algorithm, size of the data, time and money constraints, etc.), we probably could provide you a more specific answer.
edit : another reference that could prove to be of interest : Scaling-up Machine Learning
I had to implement a couple of Data Mining algorithms to work with BigData too, and I ended up using Hadoop.
I don't know if you are familiar to Mahout (http://mahout.apache.org/), which already has several algorithms ready to use with Hadoop.
Nevertheless, if you want to implement your own Algorithm, you can still adapt it to Hadoop's MapReduce paradigm and get good results. This is an excellent book on how to adapt Artificial Intelligence algorithms to MapReduce:
Mining of Massive Datasets - http://infolab.stanford.edu/~ullman/mmds.html
This seems to be an old question. However given your usecase, the main frameworks focusing on Machine Learning in Big Data domain are Mahout, Spark (MLlib), H2O etc. However to run Machine Learning algorithms on Big Data you have to convert them to parallel programs based on Map Reduce paradigm. This is a nice article giving a brief introduction to major (not all) big Data frameworks:
http://www.codophile.com/big-data-frameworks-every-programmer-should-know/
I hope this will help.

Publicly Available Spam Filter Training Set [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).
I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.
Here is what I was looking for: http://untroubled.org/spam/
This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com
Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.
I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.
A few characteristics of the Spambase set:
4601 data points--all complete
each comprised of 58 features
(attributes)
each data point is labelled 'spam' or
'no spam'
approx. 40% are labeled spam
of the features, all are continuous
(vs. discrete)
a representative feature: average
continuous sequence of capital
letters
Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.
SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.
You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.
The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).
The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/
A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.
I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

Machine learning/information retrieval project

I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)

General frameworks for preparing training data? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
As a student of computational linguistics, I frequently do machine learning experiments where I have to prepare training data from all kinds of different resources like raw or annotated text corpora or syntactic tree banks. For every new task and every new experiment I write programs (normally in Python and sometimes Java) to extract the features and values I need and transform the data from one format to the other. This usually results in a very large number of very large files and a very large number of small programs which process them in order to get the input for some machine learning framework (like the arff files for Weka).
One needs to be extremely well organised to deal with that and program with great care not to miss any important peculiarities, exceptions or errors in the tons of data. Many principles of good software design like design patterns or refactoring paradigms are no big use for these tasks because things like security, maintainability or sustainability are of no real importance - once the program successfully processed the data one doesn't need it any longer. This has gone so far that I even stopped bothering about using classes or functions at all in my Python code and program in a simple procedural way. The next experiment will require different data sets with unique characteristics and in a different format so that their preparation will likely have to be programmed from scratch anyway. My experience so far is that it's not unusual to spend 80-90% of a project's time on the task of preparing training data. Hours and days go by only on thinking about how to get from one data format to another. At times, this can become quite frustrating.
Well, you probably guessed that I'm exaggerating a bit, on purpose even, but I'm positive you understand what I'm trying to say. My question, actually, is this:
Are there any general frameworks, architectures, best practices for approaching these tasks? How much of the code I write can I expect to be reusable given optimal design?
I find myself mostly using the textutils from GNU coreutils and flex for corpus preparation, chaining things together in simple scripts, at least when the preparations i need to make are simple enough for regular expressions and trivial filtering etc.
It is still possible to make things reusable, the general rules also apply here. If you are programming with no regard to best practices and the like and just program procedurally there is IMHO really no wonder that you have to do everything from scratch when starting a new project.
Even though the format requirements will vary a lot there is still many common tasks, ie. tag-stripping, tag-translation, selection, tabulation, some trivial data harvesting such as number of tokens, sentences and the like. Programming these tasks aming for high reusability will pay off, even though it takes longer at first.
I am not aware of any such frameworks--doesn't mean they aren't out there. I prefer to use my own which is just a collection of code snippets i've refined/tweaked/borrowed over time and that i can chain together in various configurations depending on the problem. If you already know python, then i strongly recommend handling all of your data prep in NumPy--as you know, ML data sets tends to be large--thousands of row vectors packed with floats. NumPy is brilliant for that sort of thing. Additionally, I might suggest that for preparing training data for ML, there are a couple of tasks that arise in nearly every such effort and that don't vary a whole lot from one problem to the next. I've give you snippets for these below.
normalization (scaling & mean-centering your data to avoid overweighting. As i'm sure you know, you can scale -1 to 1 or 0 to 1. I usually chose the latter so that i can take advantage of sparsity patterns. In python, using the NumPy library:
import numpy as NP
data = NP.linspace( 1, 12, 12).reshape(4, 3)
data_norm = NP.apply_along_axis( lambda x : (x - float(x.min())) / x.max(),
0, data )
cross-validation (here's i've set the default argument at '5', so test set is 5%, training set, 95%--putting this in a function makes k-fold much simpler)
def divide_data(data, testset_size=5) :
max_ndx_val = data.shape[0] -1
ndx2 = NP.random.random_integers(0, max_ndx_val, testset_size)
TE = data_rows[ndx2]
TR = NP.delete(data, ndx2, axis=0)
return TR, TE
Lastly, here's an excellent case study (IMHO), both clear and complete, showing literally the entire process from collection of the raw data through input to the ML algorithm (a MLP in this case). They also provide their code.

Resources