I am working on a way to represent C/C++ program code. in order to create a dataset and do some machine learning after that.
Thinking about code as text and doing some text mining doesn't seem correct for me. because i'm more interesseted by the semantic and precision of calculations.
So what could be a good representative vector of programms ?
Thanks.
I take it that you don't want to represent your programs as sequences of tokens.
Remember you don't have to actually represent code as words. If you're interested in semantic relationships you can use higher-level descriptions - for example you can use parse trees of expressions rather than tokens.
You can also take this grammatical approach further and represent the whole program as parse tree in some grammar rather than a sequence of tokens. There are recurrent networks that can handle tree-structured data.
Related
I am trying to wrap my head around the concept of probabilistic programming but the more I read, the more I feel confused.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
I have some background in machine learning and I remember some machine learning models also output probabilities and then I come across the term probabilistic machine learning...
Is there a difference between the two? Or are they something similar?
Appreciate anyone that could help clarify.
I guess there is some vagueness between the two terms but my take on them is the following:
Probabilistic Programming it is expressing probabilistic models as computer programs that generate data (i.e. simulators).
Probabilistic Models + Programming = Probabilistic Programming
There is no say about what comprise a probabilistic model (it may well be a neural network of some sorts). Therefore, I view this term as:
More generic
More frequently used in an applied context (with relation to programming)
Probabilistic Machine Learning
is a another flavour of ML which deals with probabilistic aspects of predictions, e.g. the model does not treat input / output values as certain and/or point values, but instead treats them (or some of them) as random variables. Prominent example of such an approach is Gaussian Process.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
That is correct. A probabilistic program can be viewed as equivalent to a Bayesian network, but is expressed in a much richer language. Probabilistic programming as a field proposes such representations, as well as algorithms that take advantage of those representations, because sometimes a richer representation makes the problem easier.
Consider for example a probabilistic program that models a disease that is more likely to afflict men:
N = 1000000;
for i = 1:N {
male[i] ~ Bernoulli(0.5);
disease[i] ~ if male[i] then Bernoulli(0.8) else Bernoulli(0.3)
}
This probabilistic program is equivalent to the following Bayesian network accompanied by the appropriate conditional probability tables:
For highly repetitive networks such as this, authors often use plate notation to make their depiction more succinct:
However, plate notation is a device for human-readable publications, not a formal language in the same sense that a programming language is. Also, for more complex models, plate notation could become harder to understand and maintain. Finally, a programming language brings other benefits, such as primitive operations that make it easier to express conditional probabilities.
So, is it just a matter of having a convenient representation? No, because a more abstract representation contains more high-level information that can be used to improve inference performance.
Suppose that we want to compute the probability distribution on the number of people among N individuals with the disease. A straightforward and generic Bayesian network algorithm would have to consider the large number of 2^N combinations of assignments to the disease variables in order to compute that answer.
The probabilistic program representation, however, explicitly indicates that the conditional probabilities for disease[i] and male[i] are identical for all i. An inference algorithm can exploit that to compute the marginal probability of disease[i], which is identical for all i, use the fact that the number of diseased people will therefore be a binomial distribution B(N, P(disease[i])) and provide that as the desired answer, in time constant in N. It would also be able to provide an explanation of this conclusion that would be more understandable and insightful to a user.
One could argue that this comparison is unfair because a knowledgeable user would not pose the query as defined for the explicit O(N)-sized Bayesian network, but instead simplify the problem in advance by exploiting its simple structure. However, the user may not be knowledgeable enough for such a simplification, particularly for more complex cases, or may not have the time to do it, or may make a mistake, or may not know in advance what the model will be so she cannot simplify it manually like that. Probabilistic programming offers the possibility that such simplification be made automatically.
To be fair, most current probabilistic programming tools (such as JAGS and Stan) do not perform this more sophisticated mathematical reasoning (often called lifted probabilistic inference) and instead simply perform Markov Chain Monte Carlo (MCMC) sampling over the Bayesian network equivalent to the probabilistic program (but often without having to build the entire network in advance, which is also another possible gain). In any case, this convenience already more than justifies their use.
I have recently had the need to create an ANTLR language grammar for the purpose of a transpiler (Converting one scripting language to another). It occurs to me that Google Translate does a pretty good job translating natural language. We have all manner of recurrent neural network models, LSTM, and GPT-2 is generating grammatically correct text.
Question: Is there a model sufficient to train on grammar/code example combinations for the purpose of then outputting a new grammar file given an arbitrary example source-code?
I doubt any such model exists.
The main issue is that languages are generated from the grammars and it is next to impossible to convert back due to the infinite number of parser trees (combinations) available for various source-codes.
So in your case, say you train on python code (1000 sample codes), the resultant grammar for training will be the same. So, the model will always generate the same grammar irrespective of the example source code.
If you use training samples from a number of languages, the model still can't generate the grammar as it consists of an infinite number of possibilities.
Your example of Google translate works for real life translation as small errors are acceptable, but these models don't rely on generating the root grammar for each language. There are some tools that can translate programming languages example, but they don't generate the grammar, work based on the grammar.
Update
How to learn grammar from code.
After comparing to some NLP concepts, I have a list of issue that may arise and a way to counter them.
Dealing with variable names, coding structures and tokens.
For understanding the grammar, we'll have to breakdown the code to its bare minimum form. This means understanding what each and every term in the code means. Have a look at this example
The already simple expression is reduced to the parse tree. We can see that the tree breaks down the expression and tags each number as a factor. This is really important to get rid of the human element of the code (such as variable names etc.) and dive into the actual grammar. In NLP this concept is known as Part of Speech tagging. You'll have to develop your own method to do the tagging, by it's easy given that you know grammar for the language.
Understanding the relations
For this, you can tokenize the reduced code and train using a model based on the output you are looking for. In case you want to write code, make use of a n grams model using LSTM like this example. The model will learn the grammar, but extracting it is not a simple task. You'll have to run separate code to try and extract all the possible relations learned by the model.
Example
Code snippet
# Sample code
int a = 1 + 2;
cout<<a;
Tags
# Sample tags and tokens
int a = 1 + 2 ;
[int] [variable] [operator] [factor] [expr] [factor] [end]
Leaving the operator, expr and keywords shouldn't matter if there is enough data present, but they will become a part of the grammar.
This is a sample to help understand my idea. You can improve on this by having a deeper look at the Theory of Computation and understanding the working of the automata and the different grammars.
What you're describing is 'just' learning structure of Context-Free Grammars.
I'm not sure if this approach will actually work for your case, but it's a long-standing problem in NLP: grammar induction for Context-Free Grammars. An example introduction how to tackle this problem using statistical learning methods can be found in Charniak's Statistical Language Learning.
Note that what I described is about CFGs in general, but you might want to check induction for LL grammars, because parser generators mostly use these types of grammars.
I know nothing about ANTLR, but there are pretty good examples of translating natural language e.g. into valid SQL requests: http://nlpprogress.com/english/semantic_parsing.html#sql-parsing.
I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/
I'm working on a project that is trying to use context-free grammars for parsing images. We are trying to construct trees of image segments, then use machine learning to parse images using these visual grammars.
I have found SVM-CFG which looks ideal, the trouble is that it is designed for string parsing, where each terminal in the string has at most two neighbors (the words before and after). In our visual grammar, each segment can be next to an arbitrary number of other segments.
What is the best way to parse these visual grammars? Specifically, can I encode my data to use SVM-CFG? Or am I going to have to write my own Kernel/parsing library?
SVM-CFG is a specific implementation of the cutting plane optimization algorithm used in SVM-struct (described here http://www.cs.cornell.edu/People/tj/publications/tsochantaridis_etal_04a.pdf, Section 4).
At each step, the cutting plane algorithm calls a function to find the highest scoring structured output assignment (in SVM-CFG this is the highest scoring parse).
For one-dimensional strings, SVM-CFG runs a dynamic programming algorithm to find the highest scoring parse in polynomial time.
You could extend SVM-struct to return the highest scoring parse for an image, but no polynomial-time algorithm exists to do this!
Here is a reference for a state-of-the-art technique that parses images: http://www.socher.org/uploads/Main/SocherLinNgManning_ICML2011.pdf. They run into the same problem for finding the highest scoring parse of an image segmentation, so they use a greedy algorithm to find an approximate solution (see section 4.2). You might be able to incorporate a similar greedy algorithm into SVM-struct.
Binarization is the act of transforming colorful features of of an entity into vectors of numbers, most often binary vectors, to make good examples for classifier algorithms.
If we where to binarize the sentence "The cat ate the dog", we could start by assigning every word an ID (for example cat-1, ate-2, the-3, dog-4) and then simply replace the word by it's ID giving the vector <3,1,2,3,4>.
Given these IDs we could also create a binary vector by giving each word four possible slots, and setting the slot corresponding to a specific word with to one, giving the vector <0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1>. The latter method is, as far as I know, is commonly referred to as the bag-of-words-method.
Now for my question, what is the best binarization method when it comes to describe features for natural language processing in general, and transition-based dependency parsing (with Nivres algorithm) in particular?
In this context, we do not want to encode the whole sentence, but rather the current state of the parse, for example the top word on the stack en the first word in the input queue. Since order is highly relevant, this rules out the bag-of-words-method.
With best, I am referring to the method that makes the data the most intelligible for the classifier, without using up unnecessary memory. For example I don't want a word bigram to use 400 million features for 20000 unique words, if only 2% the bigrams actually exist.
Since the answer is also depending on the particular classifier, I am mostly interested in maximum entropy models (liblinear), support vector machines (libsvm) and perceptrons, but answers that apply to other models are also welcome.
This is actually a really complex question. The first decision you have to make is whether to lemmatize your input tokens (your words). If you do this, you dramatically decrease your type count, and your syntax parsing gets a lot less complicated. However, it takes a lot of work to lemmatize a token. Now, in a computer language, this task gets greatly reduced, as most languages separate keywords or variable names with a well defined set of symbols, like whitespace or a period or whatnot.
The second crucial decision is what you're going to do with the data post-facto. The "bag-of-words" method, in the binary form you've presented, ignores word order, which is completely fine if you're doing summarization of a text or maybe a Google-style search where you don't care where the words appear, as long as they appear. If, on the other hand, you're building something like a compiler or parser, order is very much important. You can use the token-vector approach (as in your second paragraph), or you can extend the bag-of-words approach such that each non-zero entry in the bag-of-words vector contains the linear index position of the token in the phrase.
Finally, if you're going to be building parse trees, there are obvious reasons why you'd want to go with the token-vector approach, as it's a big hassle to maintain sub-phrase ids for every word in the bag-of-words vector, but very easy to make "sub-vectors" in a token-vector. In fact, Eric Brill used a token-id sequence for his part-of-speech tagger, which is really neat.
Do you mind if I ask what specific task you're working on?
Binarization is the act of
transforming colorful features of
an entity into vectors of numbers,
most often binary vectors, to make
good examples for classifier
algorithms.
I have mostly come across numeric features that take values between 0 and 1 (not binary as you describe), representing the relevance of the particular feature in the vector (between 0% and 100%, where 1 represents 100%). A common example for this are tf-idf vectors: in the vector representing a document (or sentence), you have a value for each term in the entire vocabulary that indicates the relevance of that term for the represented document.
As Mike already said in his reply, this is a complex problem in a wide field. In addition to his pointers, you might find it useful to look into some information retrieval techniques like the vector space model, vector space classification and latent semantic indexing as starting points. Also, the field of word sense disambiguation deals a lot with feature representation issues in NLP.
[Not a direct answer] It all depends on what you are try to parse and then process, but for general short human phrase processing (e.g. IVT) another method is to use neural networks to learn the patterns. This can be very acurate for smallish vocubularies