While my research area is in Machine Learning (ML), I am required to take a project in Programming Languages (PL). Therefore, I'm looking to find a project that is inclined towards ML.
One intersection I know of between the two fields is Natural Language Processing (NLP), but I couldn't find concrete papers in that topic that are related to PL; perhaps due to my poor choice of keywords in the search query.
The main topics in the PL course are : Syntax & Symantics, Static Program Analysis, Functional Programming, and Concurrency and Logic programming
If you could suggest papers or keywords that are Machine Learning enthusiast friendly, that would be highly appreciated!
Another very important intersection in these fields is probabilistic programming languages, which provide probabilistic inference over models specified as actual computer programs. It's a growing research field, including a recently started DARPA program on this topic.
If you are interested in NLP, then I would focus on two aspects of listed PL disciplines:
Syntax & Semantics - as this is incredibly closely realted to the NLP field, where in most cases the understanding is based on the various language grammars. Searching for papers regarding language modeling, information extraction, deep parsing would yield dozens of great research topics which are heavil related to the sytax/semantics problems.
logic programming -"in good old years" people believed that this is a future of AI, even though it is not (currently) true, it is still quite widely used forreasoning in some fields. In particular, prolog is a good example of language that can be used to reson (for example spatial-temporal reasoning) or even parse language (due to its "grammar like" productions).
If you wish to tackle some more ML related problem rather then NLP then you could focus on concurrency (parallelism) as it is very hot topic - making ML models more scalable, more efficient, "bigger, faster, stronger" ;) Just lookup keywords like GPU Machine Learning, large scale machine learning, scalable machine learning etc.
I also happen to know that there's a project at the University of Edinburgh on using machine learning to analyse source code. Here's the first publication that came out of it
Related
Are there any known approaches of making a machine learn calculus?
I've learnt that it is quite simple to teach calculating derivatives because it is possible to implement an algorithm.
Meanwhile, an implementation of integration is possible but is rarely or never fully implemented due to the algorithmic complexity.
I am curious whether there are any academic successes in the field of using machine learning science to evaluate and calculate integrals.
Edit
I am interested in teaching a computer to integrate using neural networks or similar methods.
My personal opinion it is not possible to feed into NN enough rules for integrating. Why? Because NN are good for linear regression ( AKA approximation ) or logical regression ( AKA classification ). Integration is neither of them. It is calculation task according to some strict algorithms. So from this prospective it's good idea to use some mathematical ways to integrate.
Update on 2020-10-23
Right now I'm in position of being ashamed by new developments according to news. Facebook recently announced that they developed some kind of AI, which is good in solving integrations.
There quite a few number of maths software that will compute derivatives and integral calculus for you. Some of the popular software include MATLAB, Maple, Mathematica, etc. These software will help you learn quite easily.
As for you making a machine learn calculus ...
You can read up on the following on wikipedia or other books,
Newton's Method - Solve the roots of a function numerically
Monte Carlo Integration - uses RNG to compute numeric integration
Runge Kutta Method - Solves ODE's iteratively
There are many more. These are just the ones I was taught in undergraduate school. They are also fairly simple to understand, depending on your level of academia. But in general, people have been try to numerically compute solutions to models since Newton. Computers have just made everything a lot easier.
Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.
I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability
I'm mainly just looking for a discussion of approaches on how to go from decentralized, non-normalized, completely open user-submitted tags, to start making sense of all of it through combining them into those semantic groups they called "clusters".
Does it take actual people to figure out what people actually mean by the tags used, or can it be done simply by automatically analyzing how often the tags go together?
That kind of stuff. Feel free to elaborate wildly :) (Also, if this has been discussed elsewhere, I'd love to hear about it).
Read this article: Automated Tag Clustering. It provides a good overview of the existing approaches and describes the algorithms for tag clustering.
Algorithms of the Intelligent Web (Manning) (esp. Chapter 4) and a book with a similar title from O'Reilly cover clustering algorithms. The Manning book starts with naive SQL approaches and moves to K-means, ROCK, and DBSCAN. It's more generalized than just focusing on tags, but easy to apply in that context. Code is presented in Java but is easily adapted to Ruby (sometimes more easily than adapting the Java code to your problem).
Chapter 5 covers classifications, which is about building topologies, and discusses Bayesian algorithms.
I am asking this question because I know there are a lot of well-read CS types on here who can give a clear answer.
I am wondering if such an AI exists (or is being researched/developed) that it writes programs by generating and compiling code all on it's own and then progresses by learning from former iterations. I am talking about working to make us, programmers, obsolete. I'm imagining something that learns what works and what doesn't in a programming languages by trial and error.
I know this sounds pie-in-the-sky so I'm asking to find out what's been done, if anything.
Of course even a human programmer needs inputs and specifications, so such an experiment has to have carefully defined parameters. Like if the AI was going to explore different timing functions, that aspect has to be clearly defined.
But with a sophisticated learning AI I'd be curious to see what it might generate.
I know there are a lot of human qualities computers can't replicate like our judgement, tastes and prejudices. But my imagination likes the idea of a program that spits out a web site after a day of thinking and lets me see what it came up with, and even still I would often expect it to be garbage; but maybe once a day I maybe give it feedback and help it learn.
Another avenue of this thought is it would be nice to give a high-level description like "menued website" or "image tools" and it generates code with enough depth that would be useful as a code completion module for me to then code in the details. But I suppose that could be envisioned as a non-intelligent static hierarchical code completion scheme.
How about it?
Such tools exist. They are the subject of a discipline called Genetic Programming. How you evaluate their success depends on the scope of their application.
They have been extremely successful (orders of magnitude more efficient than humans) to design optimal programs for the management of industrial process, automated medical diagnosis, or integrated circuit design. Those processes are well constrained, with an explicit and immutable success measure, and a great amount of "universe knowledge", that is a large set of rules on what is a valid, working, program and what is not.
They have been totally useless in trying to build mainstream programs, that require user interaction, because the main item a system that learns needs is an explicit "fitness function", or evaluation of the quality of the current solution it has come up with.
Another domain that can be seen in dealing with "program learning" is Inductive Logic Programming, although it is more used to provide automatic demonstration or language / taxonomy learning.
Disclaimer: I am not a native English speaker nor an expert in the field, I am an amateur - expect imprecisions and/or errors in what follow. So, in the spirit of stackoverflow, don't be afraid to correct and improve my prose and/or my content. Note also that this is not a complete survey of automatic programming techniques (code generation (CG) from Model-Driven Architectures (MDAs) merits at least a passing mention).
I want to add more to what Varkhan answered (which is essentially correct).
The Genetic Programming (GP) approach to Automatic Programming conflates, with its fitness functions, two different problems ("self-compilation" is conceptually a no-brainer):
self-improvement/adaptation - of the synthesized program and, if so desired, of the synthesizer itself; and
program synthesis.
w.r.t. self-improvement/adaptation refer to Jürgen Schmidhuber's Goedel machines: self-referential universal problem solvers making provably optimal self-improvements. (As a side note: interesting is his work on artificial curiosity.) Also relevant for this discussion are Autonomic Systems.
w.r.t. program synthesis, I think is possible to classify 3 main branches: stochastic (probabilistic - like above mentioned GP), inductive and deductive.
GP is essentially stochastic because it produces the space of likely programs with heuristics such as crossover, random mutation, gene duplication, gene deletion, etc... (than it tests programs with the fitness function and let the fittest survive and reproduce).
Inductive program synthesis is usually known as Inductive Programming (IP), of which Inductive Logic Programming (ILP) is a sub-field. That is, in general the technique is not limited to logic program synthesis or to synthesizers written in a logic programming language (nor both are limited to "..automatic demonstration or language/taxonomy learning").
IP is often deterministic (but there are exceptions): starts from an incomplete specification (such as example input/output pairs) and use that to constraint the search space of likely programs satisfying such specification and then to test it (generate-and-test approach) or to directly synthesize a program detecting recurrences in the given examples, which are then generalized (data-driven or analytical approach). The process as a whole is essentially statistical induction/inference - i.e. considering what to include into the incomplete specification is akin to random sampling.
Generate-and-test and data-driven/analytical§ approaches can be quite fast, so both are promising (even if only little synthesized programs are demonstrated in public until now), but generate-and-test (like GP) is embarrassingly parallel and then notable improvements (scaling to realistic program sizes) can be expected. But note that Incremental Inductive Programming (IIP)§, which is inherently sequential, has demonstrated to be orders of magnitude more effective of non-incremental approaches.
§ These links are directly to PDF files: sorry, I am unable to find an abstract.
Programming by Demonstration (PbD) and Programming by Example (PbE) are end-user development techniques known to leverage inductive program synthesis practically.
Deductive program synthesis start with a (presumed) complete (formal) specification (logic conditions) instead. One of the techniques leverage automated theorem provers: to synthesize a program, it constructs a proof of the existence of an object meeting the specification; hence, via Curry-Howard-de Bruijn isomorphism (proofs-as-programs correspondence and formulae-as-types correspondence), it extracts a program from the proof. Other variants include the use of constraint solving and deductive composition of subroutine libraries.
In my opinion inductive and deductive synthesis in practice are attacking the same problem by two somewhat different angles, because what constitute a complete specification is debatable (besides, a complete specification today can become incomplete tomorrow - the world is not static).
When (if) these techniques (self-improvement/adaptation and program synthesis) will mature, they promise to rise the amount of automation provided by declarative programming (that such setting is to be considered "programming" is sometimes debated): we will concentrate more on Domain Engineering and Requirements Analysis and Engineering than on software manual design and development, manual debugging, manual system performance tuning and so on (possibly with less accidental complexity compared to that introduced with current manual, not self-improving/adapting techniques). This will also promote a level of agility yet to be demonstrated by current techniques.