simplify equations/expressions using Javacc/jjtree - parsing

I have created a grammar to read a file of equations then created AST nodes for each rule.My question is how can I do simplification or substitute vales on the equations that the parser is able to read correctly. in which stage? before creating AST nodes or after?
Please provide me with ideas or tutorials to follow.
Thank you.

I'm assuming you equations are something like simple polynomials over real-value variables, like X^2+3*Y^2
You ask for two different solutions to two different problems that start with having an AST for at least one equation:
How to "substitute values" into the equation and compute the resulting value, e.g, for X==3 and Y=2, substitute into the AST for the formula above and compute 3^2+3*2^2 --> 21
How to do simplification: I assume you mean algebraic simplification.
The first problem of substituting values is fairly easy if yuo already have the AST. (If not, parse the equation to produce the AST first!) Then all you have to do is walk the AST, replacing every leaf node containing a variable name with the corresponding value, and then doing arithmetic on any parent nodes whose children now happen to be numbers; you repeat this until no more nodes can be arithmetically evaluated. Basically you wire simple arithmetic into a tree evaluation scheme.
Sometimes your evaluation will reduce the tree to a single value as in the example, and you can print the numeric result My SO answer shows how do that in detail. You can easily implement this yourself in a small project, even using JavaCC/JJTree appropriately adapted.
Sometimes the formula will end up in a state where no further arithmetic on it is possible, e.g., 1+x+y with x==0 and nothing known about y; then the result of such a subsitution/arithmetic evaluation process will be 1+y. Unfortunately, you will only have this as an AST... now you need to print out the resulting AST in order for the user to see the result. This is harder; see my SO answer on how to prettyprint a tree. This is considerably more work; if you restrict your tree to just polynomials over expressions, you can still do this in small project. JavaCC will help you with parsing, but provides zero help with prettyprinting.
The second problem is much harder, because you must not only accomplish variable substitution and arithmetic evaluation as above, but you have to somehow encode knowledge of algebraic laws, and how to match those laws to complex trees. You might hardwire one or two algebraic laws (e.g., x+0 -> x; y-y -> 0) but hardwiring many laws this way will produce an impossible mess because of how they interact.
JavaCC might form part of such an answer, but only a small part; the rest of the solution is hard enough so you are better off looking for an alternative rather than trying to build it all on top of JavaCC.
You need a more organized approach for this: a Program Transformation System (PTS). A typical PTS will allow you specify
a grammar for an arbitrary language (in your case, simply polynomials),
automatically parses instance to ASTs and can regenerate valid text from the AST. A good PTS will let you write source-to-source transformation rules that the PTS will apply automatically the instance AST; in your case you'd write down the algebraic laws as source-to-source rules and then the PTS does all the work.
An example is too long to provide here. But here I describe how to define formulas suitable for early calculus classes, and how to define algebraic rules that simply such formulas including applying some class calculus derivative laws.
With sufficient/significant effort, you can build your own PTS on top of JavaCC/JJTree. This is likely to take a few man-years. Easier to get a PTS rather than repeat all that work.

Related

Searching/predicting next terminal/non-terminal by CFG/Tree?

I'm looking for algorithm to help me predict next token given a string/prefix and Context free grammar.
First question is what is the exact structure representing CFG. It seems it is a tree, but what type of tree ? I'm asking because the leaves are always ordered , is there a ordered-tree ?
May be if i know the correct structure I can find algorithm for bottom-up search !
If it is not exactly a Search problem, then the next closest thing it looks like Parsing the prefix-string and then Generating the next-token ? How do I do that ?
any ideas
my current generated grammar is simple it has no OR rules (except when i decide to reuse the grammar for new sequences, i will be). It is generated by Sequitur algo and is so called SLG(single line grammar) .. but if I generate it using many seq's the TOP rule will be Ex:>
S : S1 z S3 | u S2 .. S5 S1 | S4 S2 .. |... | Sn
S1 : a b
S2 : h u y
...
..i.e. top-heavy SLG, except the top rule all others do not have OR |
As a side note I'm thinking of a ways to convert it to Prolog and/or DCG program, where may be there is easier way to do what I want easily ?! what do you think ?
TL;DR: In abstract, this is a hard problem. But it can be pretty simple for given grammars. Everything depends on the nature of the grammar.
The basic algorithm indeed starts by using some parsing algorithm on the prefix. A rough prediction can then be made by attempting to continue the parse with each possible token, retaining only those which do not produce immediate errors.
That will certainly give you a list which includes all of the possible continuations. But the list may also include tokens which cannot appear in a correct input. Indeed, it is possible that the correct list is empty (because the given prefix is not the prefix of any correct input); this will happen if the parsing algorithm is unable to correctly verify whether a token sequence is a possible prefix.
In part, this will depend on the grammar itself. If the grammar is LR(1), for example, then the LR(1) parsing algorithm can precisely identify the continuation set. If the grammar is LR(k) for some k>1, then it is theoretically possible to produce an LR(1) grammar for the same language, but the resulting grammar might be impractically large. Otherwise, you might have to settle for "false positives". That might be acceptable if your goal is to provide tab-completion, but in other circumstances it might not be so useful.
The precise datastructure used to perform the internal parse and exploration of alternatives will depend on the parsing algorithm used. Many parsing algorithms, including the standard LR parsing algorithm whose internal data structure is a simple stack, feature a mutable internal state which is not really suitable for the exploration step; you could adapt such an algorithm by making a copy of the entire internal data structure (that is, the stack) before proceeding with each trial token. Alternatively, you could implement a copy-on-write stack. But the parser stack is not usually very big, so copying it each time is generally feasible. (That's what Bison does to produce expanded error messages with an "expected token" list, and it doesn't seem to trigger unacceptable runtime overhead in practice.)
Alternatively, you could use some variant of CYK chart parsing (or a GLR algorithm like the Earley algorithm), whose internal data structures can be implemented in a way which doesn't involve destructive modification. Such algorithms are generally used for grammars which are not LR(1), since they can cope with any CFG although highly ambiguous grammars can take a long time to parse (proportional to the cube of the input length). As mentioned above, though, you will get false positives from such algorithms.
If false positives are unacceptable, then you could use some kind of heuristic search to attempt to find an input sequence which completes the trial prefix. This can in theory take quite a long time, but for many grammars a breadth-first search can find a completion within a reasonable time, so you could terminate the search after a given maximum time. This will not produce false positives, but the time limit might prevent it from finding the complete set of possible continuations.

Source code logic evaluation

I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.

Can you "teach" computers to do algebra using variable expressions (eg aX+bX=(a+b)X)

Let's say in the example lower case is constant and upper case is variable.
I'd like to have programs that can "intelligently" do specified tasks like algebra, but teaching the program new methods should be easy using symbols understood by humans. For example if the program told these facts:
aX+bX=(a+b)X
if a=bX then X=a/b
Then it should be able to perform these operations:
2a+3a=5a
3x+3x=6x
3x=1 therefore x=1/3
4x+2x=1 -> 6x=1 therefore x= 1/6
I was trying to do similar things with Prolog as it can easily "understand" variables, but then I had too many complications, mainly because two describing a relationship both ways results in a crash. (not easy to sort out)
To summarise: I want to know if a program which can be taught algebra by using mathematic symbols only. I'd like to know if other people have tried this and how complicated it is expected to be. The purpose of this is to make programming easier (runtime is not so important)
It depends on what do you want machine to do and how intelligent it should be.
Your question is mostly about AI but not ML. AI deals with formalization of "human" tasks while ML (though being a subset of AI) is about building models from data.
Described program may be implemented like this:
Each fact form a pattern. Program given with an expression and some patterns can try to apply some of them to expression and see what happens. If you want your program to be able to, for example, solve quadratic equations given rule like ax² + bx + c = 0 → x = (-b ± sqrt(b²-4ac))/(2a) then it'd be designed as follows:
Somebody gives a set of rules. Rule consists of a pattern and an outcome (solution or equivalent form). Think about the pattern as kind of a regular expression.
Then the program is asked to show some intelligence and prove its knowledge via doing something with a given expression. Here comes the major part:
you build a graph of expressions by applying possible rules (if a pattern is applicable to an expression you add new vertex with the corresponding outcome).
Then you run some path-search algorithm (A*, for example) to find sequence of transformations leading to the form like x = ...
I think this is an interesting question, although it off topic in SO (tool recommendation)
But nevertheless, because it captured my imagination, I wrote couple of function using R that can solve stuff like that quite easily
First, you'll have to install R, after words you'll need to download package called stringr
So in R console run
install.packages("stringr")
library(stringr)
And then you can define the following functions that I wrote
FirstFunc <- function(temp){
paste0(eval(parse(text = gsub("[A-Z]", "", temp))), unique(str_extract_all(temp, "[A-Z]")[[1]]))
}
SecondFunc <- function(temp){
eval(parse(text = strsplit(temp, "=")[[1]][2])) / eval(parse(text = gsub("[[:alpha:]]", "", strsplit(temp, "=")[[1]][1])))
}
Now, the first function will solve equations like
aX+bX=(a+b)X
While the second will solve equations like
4x+2x=1
For example
FirstFunc("3X+6X-2X-3X")
will return
"4X"
Now this functions is pretty primitive (mostly for the propose of illustration) and will solve equation that contain only one variable type, something like FirstFunc("3X-2X-2Y") won't give the correct result (but the function could be easily modified)
The second function will solve stuff like
SecondFunc("4x-2x=1")
will return
0.5
or
SecondFunc("4x+2x*3x=1")
will return
0.1
Note that this function also works only for one unknown variable (x) but could be easily modified too

Refactoring of decision trees using automatic learning

The problem is the following:
I developed an expression evaluation engine that provides a XPath-like language to the user so he can build the expressions. These expressions are then parsed and stored as an expression tree. There are many kinds of expressions, including logic (and/or/not), relational (=, !=, >, <, >=, <=), arithmetic (+, -, *, /) and if/then/else expressions.
Besides these operations, the expression can have constants (numbers, strings, dates, etc) and also access to external information by using a syntax similar to XPath to navigate in a tree of Java objects.
Given the above, we can build expressions like:
/some/value and /some/other/value
/some/value or /some/other/value
if (<evaluate some expression>) then
<evaluate some other expression>
else
<do something else>
Since the then-part and the else-part of the if-then-else expressions are expressions themselves, and everything is considered to be an expression, then anything can appear there, including other if-then-else's, allowing the user to build large decision trees by nesting if-then-else's.
As these expressions are built manually and prone to human error, I decided to build an automatic learning process capable of optimizing these expression trees based on the analysis of common external data. For example: in the first expression above (/some/value and /some/other/value), if the result of /some/other/value is false most of the times, we can rearrange the tree so this branch will be the left branch to take advantage of short-circuit evaluation (the right side of the AND is not evaluated since the left side already determined the result).
Another possible optimization is to rearrange nested if-then-else expressions (decision trees) so the most frequent path taken, based on the most common external data used, will be executed sooner in the future, avoiding unnecessary evaluation of some branches most of the times.
Do you have any ideas on what would be the best or recommended approach/algorithm to use to perform this automatic refactoring of these expression trees?
I think what you are describing is compiler optimizations which is a huge subject with everything from
inline expansion
deadcode elimination
constant propagation
loop transformation
Basically you have a lot of rewrite rules that are guaranteed to preserve the functionality of the code/xpath.
In the question on rearranging of the nested if-else I don't think you need to resort to machine-learning.
One (I think optimal) approach would be to use Huffman coding of your links huffman_coding
Take each path as a letter and we then encode them with Huffman coding and get a so called Huffman tree. This tree will have the least evaluations running on a (large enough) sample with the same distribution you made the Huffman tree from.
If you have restrictions on ``evaluate some expression''-expresssion or that they have different computational cost etc. You probably need another approach.
And remember, as always when it comes to optimization you should be careful and only do things that really matter.

use variables in Lemon parser?

I want to allow mathematical variables in my Lemon parser-driven app. For example, if the user enters x^2+y, I want to then be able to evaluate this for 100000 different pairs of values of x and y, hopefully without having to reparse each time. The only way I can think of to do that is to have the parser generate a tree of objects, which then evaluates the expression when given the input. Is there a better/simpler/faster way?
Performance may be an issue here. But I also care about ease of coding and code upkeep.
If you want the most maintainable code, evaluate the expression as you parse. Don't build a tree.
If you want to re-execute the expression a lot, and the expression is complicated, you'll need to avoid reparsing (in order of most to least maintainable): build tree and evaluate, generate threaded code and evaluate, generate native code and evaluate.
If the expressions are generally as simple as your example, a recursive descent hand-coded parser that evaluates on the fly will likely be very fast, and work pretty well, even for 100,000 iterations. Such parsers will likely take much less time to execute than Lemon.
That is indeed how you would typically do it, unless you want to generate actual (real or virtual) code. x and y would just be variables in your case so you would fill in the actual values and then call your Evaluate function to evaluate the expression. The tree node would then contain pointers to the variables x and y, and so on. No need to parse it for each pair of test values.

Resources