opennlp vs corenlp : Market reach - popularity - parsing

I am just doing the comparative study of open source NLP tools, and got an idea about the features/services of openNLP and coreNLP engines. In the recent past, I see that no contribution made for openNLP forum, where as coreNLP forum is still going active. So I wanted to understand if stanford:coreNLP has become more popular and been widely used in commercial applications? Anyone has an idea about it?

Apache OpenNLP is actively developed. Take a look at the commit history [1], there are commits done almost everyday by different contributors and they cut four releases this years (1.7.0, 1.7.1, 1.7.2, and just recently 1.8.0).
OpenNLP is licensed under company friendly Apache License 2.0, compared to CoreNLP which is licensed under GPL which is difficult to use in commercial software (e.g. software being distributed must be released under GPL as well), but they are selling commercial licenses.
OpenNLP is developed mostly by companies which run it in their production systems, where CoreNLP is made by a researchers at Stanford.
CoreNLP has a quite a few dependencies which are pulled into your project, where OpenNLP has zero dependencies.
OpenNLP can support you with the following tasks:
Sentence Detection
Tokenization
Chunking
Named Entity Recognition
Pos Tagging
Parsing
Stemming
Language Model
Lemmatization
Document classification
OpenNLP is highly customizable, easy to train on user data, has support for training on many publicly available corpora and features built-in evaluation to measure performance of every component.
CoreNLP supports these tasks:
Sentence Detection
Tokenization
Named Entity Recognition
Pos Tagging
Parsing (also dependency parsing)
Sentiment
Coreference
Lemmatization
Relation Extraction
[1] https://github.com/apache/opennlp/commits/master

Related

Is there a native library written in Julia for Machine Learning?

I have started using Julia.I read that it is faster than C.
So far I have seen some libraries like KNET and Flux, but both are for Deep Learning.
also there is a command "Pycall" tu use Python inside Julia.
But I am interested in Machine Learning too. So I would like to use SVM, Random Forest, KNN, XGBoost, etc but in Julia.
Is there a native library written in Julia for Machine Learning?
Thank you
A lot of algorithms are just plain available using dedicated packages. Like BayesNets.jl
For "classical machine learning" MLJ.jl which is a pure Julia Machine Learning framework, it's written by the Alan Turing Institute with very active development.
For Neural Networks Flux.jl is the way to go in Julia. Also very active, GPU-ready and allow all the exotics combinations that exist in the Julia ecosystem like DiffEqFlux.jl a package that combines Flux.jl and DifferentialEquations.jl.
Just wait for Zygote.jl a source-to-source automatic differentiation package that will be some sort of backend for Flux.jl
Of course, if you're more confident with Python ML tools you still have TensorFlow.jl and ScikitLearn.jl, but OP asked for pure Julia packages and those are just Julia wrappers of Python packages.
Have a look at this kNN implementation and this for XGboost.
There are SVM implementations, but outdated an unmaintained (search for SVM .jl). But, really, think about other algorithms for much better prediction qualities and model construction performance. Have a look at the OLS (orthogonal least squares) and OFR (orthogonal forward regression) algorithm family. You will easily find detailed algorithm descriptions, easy to code in any suitable language. However, there is currently no Julia implementation I am aware of. I found only Matlab implementations and made my own java implementation, some years ago. I have plans to port it to julia, but that has currently no priority and may last some years. Meanwhile - why not coding by yourself? You won't find any other language making it easier to code a prototype and turn it into a highly efficient production algorithm running heavy load on a CUDA enabled GPGPU.
I recommend this quite new publication, to start with: Nonlinear identification using orthogonal forward regression with nested optimal regularization

Infer.NET vs ML.NET (Microsoft Machine Learning Frameworks)

After Microsoft becoming more open source friendly, I started to see more and more emphasis on Machine Learning such as ML.NET and Infer.NET.
I want to know, what is the difference between two, since both are coming from Microsoft. What will be pros and cons of both frameworks?
From MSDN:
https://blogs.msdn.microsoft.com/dotnet/2018/10/08/announcing-ml-net-0-6-machine-learning-net/
Infer.NET is now open-source and becoming part of the ML.NET family
On October 5th 2018, Microsoft Research announced the open-sourcing of
Infer.NET – a cross-platform framework for model-based machine
learning.
Infer.NET differs from traditional machine learning frameworks in that
it requires users to specify a statistical model of their problem.
This allows for high interpretability, incorporating domain knowledge,
doing unsupervised/semi-supervised learning, as well as online
inference – the ability to learn as new data arrives. The approach and
many of its applications are described in our free online book for
beginners.
Places where Infer.NET is used at Microsoft include TrueSkill – a
skill rating system for matchmaking in Halo and Gears of War, Matchbox
– a recommender system in Azure Machine Learning, and Alexandria –
automatic knowledge base construction for Satori, to name a few.
We’re working with the Infer.NET team to make it part of the ML.NET
family. Steps already taken in this direction include releasing under
the .NET Foundation and changing the package name and namespaces to
Microsoft.ML.Probabilistic.
ML.Net is for .Net Developers and Infer.Net is for open source Developer but both of them for machine learning concepts and algorithms.

Which datamining tool to use? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Can somebody explain me the main pros and cons of the most known datamining open-source tools?
Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones.
look at this blog post
Can somebody do a fast technical comparison in a small bullet list.
My needs are the following:
It should support classification algorithms (Naive Bayes, SVM, C4.5,
kNN).
It should be easy to implement in Java.
It should have understandable documentation.
It should have reference production projects or use cases working on in.
some additional benchmark comparison if possible.
Thanks!
I would like to say firstly there are pro's and cons for each of them on your list however I would suggest out of your list weka from my personal experience it is incredibly simple to implement in your own java application using the weka jar file and has its own self contained tools for data mining.
Rapid miner seems to be a commercial solution offering an end to end solution however the most notable number of examples of external implementations of solutions for rapid miner are usually in python and r script not java.
Orange offers tools that seem to be targeted primarily at people with possibly less need for custom implementations into their own software but a far easier time with user itneraction, its written in python and source is available, user addons are supported.
Knime is another commercial platform offering end to end solutions for data mining and analysis providing all the tools required, this one has various good reviews around the internet but i havent used it enough to advise you or anyone on the pro's or cons of it.
See here for knime vs weka
Best data mining tools
As i said weka is my personal favorite as a software developer but im sure other people have varying reasons and opinions on why to choose one over the other. Hope you find the right solution for you.
Also per your requirements weka supports the following:
Naivebayes
SVM
C4.5
KNN
I have tried Orange and Weka with a 15K records database and found problems with the memory management in Weka, it needed more than 16Gb of RAM while Orange could've managed the database without using that much. Once Weka reaches the maximum amount of memory, it crashes, even if you set more memory in the ini file telling Java virtual machine to use more.
I recently evaluated many open source projects, comparing and contrasted them with regards to the decision tree machine learning algorithm. Weka and KNIME were included in that evaluation. I covered the differences in algorithm, UX, accuracy, and model inspection. You might chose one or the other depending on what features you value most.
I have had positive experience with RapidMiner:
a large set of machine learning algorithms
machine learning tools - feature selection, parameter grid search, data partitioning, cross validation, metrics
a large set of data manipulation algorithms - input, transformation, output
applicable to many domains - finance, web crawling and scraping, nlp, images (very basic)
extensible - one can send and receive data other technologies: R, python, groovy, shell
portable - can be run as a java process
developer friendly (to some extent, could use some improvements) - logging, debugging, breakpoints, macros
I would have liked to see something like RapidMiner in terms of user experience, but with the underlying engine based on python technologies: pandas, scikit-learn, spacy etc. Preferably, something that would allow moving back and forth from GUI to code.

The intersection of Machine Learning and Programming Languages fields

While my research area is in Machine Learning (ML), I am required to take a project in Programming Languages (PL). Therefore, I'm looking to find a project that is inclined towards ML.
One intersection I know of between the two fields is Natural Language Processing (NLP), but I couldn't find concrete papers in that topic that are related to PL; perhaps due to my poor choice of keywords in the search query.
The main topics in the PL course are : Syntax & Symantics, Static Program Analysis, Functional Programming, and Concurrency and Logic programming
If you could suggest papers or keywords that are Machine Learning enthusiast friendly, that would be highly appreciated!
Another very important intersection in these fields is probabilistic programming languages, which provide probabilistic inference over models specified as actual computer programs. It's a growing research field, including a recently started DARPA program on this topic.
If you are interested in NLP, then I would focus on two aspects of listed PL disciplines:
Syntax & Semantics - as this is incredibly closely realted to the NLP field, where in most cases the understanding is based on the various language grammars. Searching for papers regarding language modeling, information extraction, deep parsing would yield dozens of great research topics which are heavil related to the sytax/semantics problems.
logic programming -"in good old years" people believed that this is a future of AI, even though it is not (currently) true, it is still quite widely used forreasoning in some fields. In particular, prolog is a good example of language that can be used to reson (for example spatial-temporal reasoning) or even parse language (due to its "grammar like" productions).
If you wish to tackle some more ML related problem rather then NLP then you could focus on concurrency (parallelism) as it is very hot topic - making ML models more scalable, more efficient, "bigger, faster, stronger" ;) Just lookup keywords like GPU Machine Learning, large scale machine learning, scalable machine learning etc.
I also happen to know that there's a project at the University of Edinburgh on using machine learning to analyse source code. Here's the first publication that came out of it

Reference the url address to learn data mining algorithm C5.0

Does anyone know about how to calculate the C5.0 data mining algorithm, it may be an address reference url?
No, full version of the algorithm is commercial secret of Rule Quest
C5.0 is commercial improved version of C4.5. Some of these improvements are published, some of them are not published. You can read about this in weka book. J4.8 is named so since it is java-based and its power is between C4.5 and C5.0.
You can see comparison of C4.5 and C5.0 here, and C4.5, J4.8 and C5.0 here.
I found that single threaded version is available in GPL.

Resources