Does anyone know about how to calculate the C5.0 data mining algorithm, it may be an address reference url?
No, full version of the algorithm is commercial secret of Rule Quest
C5.0 is commercial improved version of C4.5. Some of these improvements are published, some of them are not published. You can read about this in weka book. J4.8 is named so since it is java-based and its power is between C4.5 and C5.0.
You can see comparison of C4.5 and C5.0 here, and C4.5, J4.8 and C5.0 here.
I found that single threaded version is available in GPL.
Related
Hello Stack overflow community;
I am working in a scholar project using Neo4j database and i need help from members which are worked before with neo4j gds in order to finding a solution for my problem;
i want to apply a community detection algorithm called "Newman-Girvan" but it doesn't exist any algorithm with this name in neo4j gds library; i found an algorithm called "Modularity Optimization", is it the Newman-Girvan algorithm and just the name is changed or it is a different algorithm?
Thanks in advance.
I've not used the newman-girvan algorithm, but the fact that it's a hierarchical algorithm with a dendrogram output suggests you can use comparable GDS algorithms, specifically Louvain, or the newest, Leiden. Leiden has the advantage of enforcing the generation of intermediary communities. I've used both algorithms with multigraphs; I believe this capability was just introduce with gdg v 2.x.
The documentation on the algorithms is at
https://neo4j.com/docs/graph-data-science/current/
https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/leiden/
multigraph:
https://neo4j.com/docs/graph-data-science/current/graph-project-cypher-aggregation/
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Can somebody explain me the main pros and cons of the most known datamining open-source tools?
Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones.
look at this blog post
Can somebody do a fast technical comparison in a small bullet list.
My needs are the following:
It should support classification algorithms (Naive Bayes, SVM, C4.5,
kNN).
It should be easy to implement in Java.
It should have understandable documentation.
It should have reference production projects or use cases working on in.
some additional benchmark comparison if possible.
Thanks!
I would like to say firstly there are pro's and cons for each of them on your list however I would suggest out of your list weka from my personal experience it is incredibly simple to implement in your own java application using the weka jar file and has its own self contained tools for data mining.
Rapid miner seems to be a commercial solution offering an end to end solution however the most notable number of examples of external implementations of solutions for rapid miner are usually in python and r script not java.
Orange offers tools that seem to be targeted primarily at people with possibly less need for custom implementations into their own software but a far easier time with user itneraction, its written in python and source is available, user addons are supported.
Knime is another commercial platform offering end to end solutions for data mining and analysis providing all the tools required, this one has various good reviews around the internet but i havent used it enough to advise you or anyone on the pro's or cons of it.
See here for knime vs weka
Best data mining tools
As i said weka is my personal favorite as a software developer but im sure other people have varying reasons and opinions on why to choose one over the other. Hope you find the right solution for you.
Also per your requirements weka supports the following:
Naivebayes
SVM
C4.5
KNN
I have tried Orange and Weka with a 15K records database and found problems with the memory management in Weka, it needed more than 16Gb of RAM while Orange could've managed the database without using that much. Once Weka reaches the maximum amount of memory, it crashes, even if you set more memory in the ini file telling Java virtual machine to use more.
I recently evaluated many open source projects, comparing and contrasted them with regards to the decision tree machine learning algorithm. Weka and KNIME were included in that evaluation. I covered the differences in algorithm, UX, accuracy, and model inspection. You might chose one or the other depending on what features you value most.
I have had positive experience with RapidMiner:
a large set of machine learning algorithms
machine learning tools - feature selection, parameter grid search, data partitioning, cross validation, metrics
a large set of data manipulation algorithms - input, transformation, output
applicable to many domains - finance, web crawling and scraping, nlp, images (very basic)
extensible - one can send and receive data other technologies: R, python, groovy, shell
portable - can be run as a java process
developer friendly (to some extent, could use some improvements) - logging, debugging, breakpoints, macros
I would have liked to see something like RapidMiner in terms of user experience, but with the underlying engine based on python technologies: pandas, scikit-learn, spacy etc. Preferably, something that would allow moving back and forth from GUI to code.
P-Scalbility measure is method to measure the scalability of distributed systems. I have tried hard to find its formula in google. But could not find it clearly on any place.
If any one know can please explain or provide the formula details or resources from where i can understand this formula.
Thanks in anticipation.
Firstly, did you find an answer in the mean-time?
If not here is what I found so far:
I found a formula and a brief explanation of the p-scalability in the paper Evaluating the Scalability of Distributed Systems of this paper.
The reference 11 is:
P.P. Jogalekar and C.M. Woodside, aA Scalability Metric for
Distributed Computing Applications in Telecommunications,o
Proc. 15th Int'l Teletraffic Congress Teletraffic --- Contributions to the
Information Age, pp. 101-110, 1997.
For which I did not manage to find a free available version online. But is is written by the same authors of Evaluating the scalability of Distributed Systems. With an university account you could retrieve a copy.
If I find further details, I will edit the answer.
I reference also a related question asked by me
I'm having fmri dataset for the classification of Normal Controls and Alzheimer diseased patients. Now, as a newbie I'm unable to extract features from my dataset. I want to extract activation patterns, GM,WM, CSF, volumetric measures and hemo-dynamics in numerical form. Please guide me how and where to start from and please suggest some easy and efficient softwares for my work... I'll be obliged...
Take a look at the software packages called FSL (FMRIB Software Library) and SPM (Statistical Parametric Mapping).
Each of them can do the kind of analyses you're asking about. However, be warned that none of these analyses are trivial. You should probably read up a bit on the subject, first. The Handbook of Functional MRI Data Analysis is a great place to start for beginners.
Like #WeirdAlchemy says, these are many analyses you want to carry out, and all of them non-trivial. You typically learn to these over weeks at a relevant intensive course or months during a neuro Masters programme. To answer your question very explicitly:
GM, WM & CSF volumetric measures - You can do this with FSL SIENA, SPM VBM, AFNI 3Dclust, among others.
"Extract activation patterns" is too vague. In all probability, you likely have task-related BOLD fMRI data and want to perform a general linear model (GLM) analysis. FSL FEAT, SPM fMRI, AFNI and others support this. However, without knowing the experimental design, the nature of the data, and what you want to learn from it, it's hard to be more specific about which tool is appropriate.
"Haemodynamics in numerical form" This can mean a number of things, but if you are thinking about the amount of haemodynamic signal modulation (e.g. Condition led to a 2% change in BOLD signal), you get that out of the GLM analysis mentioned above.
I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)