Alternative to Rhino for expression parsing in Java

Alternative to Rhino for expression parsing in Java - rhino

In my java program I use rhino expression evaluator to parse an expression from a table into keys and values which is matched against a given java object . The performance of this evaluator is not that good and on an average takes about 50ms per evaluation. Is these a better / faster version of rhino or any other js interpretor that I can use.
Using compiled js interpretors like spiderMnkey as the whole code is Java.

Google's v8 engine is relatively speedy (and comes with some trimin's/utilities when pulled with node.js), but without knowing a little more about your expressions and comparisons, it's hard to say if you'll get a significant speed boost by using another JavaScript interpreter.

Related

How is Coq's parser implemented?

I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.

Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...

I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.

Z3: Is a custom theory extension appropriate for my application?

I have precise and validated descriptions of the behaviors of many X86 instructions in terms amenable to encoding in QF_ABV and solving directly with the standard solver (using no special solving strategies). I wrote an SMT-LIB script whose interface matches my ultimate goal perfectly:
X86State, a record sort describing x86 machine state (registers and flags as bitvectors, and memory as an array).
X86Instr, a record sort describing x86 instructions (enumerated mnemonics, operands as an ML-like discriminated union describing registers, memory expressions, etc.)
A function x86-translate taking an X86State and an X86Instr, and returning a new X86State. It decodes the X86Instr and produces a new X86State in terms of the symbolic effects of the given X86Instr on the input X86State.
It's great for prototyping: the user can write x86 easily and directly. After simplifying a formula built using the library, all functions and extraneous data types are eliminated, leaving a QF_ABV expression. I hoped that users could simply (set-logic QF_ABV) and #include my script (alas, neither the SMT-LIB standard nor Z3 support #include).
Unfortunately, by defining functions and types, the script requires theories such as uninterpreted functions, thus requiring a logic other than QF_ABV (or even QF_AUFBV due to the types). My experience with SMT solvers dictates that the lowest acceptable logic should be specified for best solving time. Also, it is unclear whether I can reuse my SMT-LIB script in a programmatic context (e.g. OCaml, Python, C) as I desire. Finally, the script is a bit verbose given the lack of higher-order functions, and my lack of access to par leading to code duplication.
Thus, despite having accomplished my technical goals, I think that SMT-LIB might be the wrong approach. Is there a more natural avenue for interacting with Z3 to implement my x86 instruction description / QF_ABV translation scheme? Is the SMT-LIB script re-usable at all in these avenues? For example, you can build "custom OCaml top-levels", i.e. interpreters with scripts "burned into them". Something like that could be nice. Or do I have to re-implement the functionality in another language, in a program that interacts with Z3 via a theory extension (C DLL)? What's the best option here?

Well, I don't think that people write .smt2 files by hand. These are usually generated automatically by some program.
I find the Z3 Python interface quite nice, so I guess you could give it a try. But you can always write a simple .smt2 dumper from any language.
BTW, do you plan releasing the specification you wrote for X86? I would be really interested!

Lexical Analysis of a Scripting Language

I am trying to create a simple script for a resource API. I have a resource API mainly creates game resources in a structured manner. What I want is dealing with this API without creating c++ programs each time I want a resource. So we (me and my instructor from uni) decided to create a simple script to create/edit resource files without compiling every time. There are also some other irrelevant factors that I need a command line interface rather than a GUI program.
Anyway, here is script sample:
<path>.<command> -<options>
/Graphics[3].add "blabla.png"
I didn't design this script language, the owner of API did. The part before '.' as you can guess is the path and part after '.' is actual command and some options, flags etc. As a first step, I tried to create grammar of left part because I thought I could use it while searching info about lexical analyzers and parser. The problem is I am inexperienced when it comes to parsing and programming languages and I am not sure if it's correct or not. Here is some more examples and grammar of left side.
dir -> '/' | '/' path
path -> object '/' path | object
object -> number | string '[' number ']'
Notation if this grammar can be a mess, I don't know. There is 5 different possibilities, they are:
String
"String"
Number
String[Number]
"String"[Number]
It has to start with '/' symbol and if it's the only symbol, I will accept it as Root.
Now my problem is how can I lexically analyze this script? Is there a special method? What should my lexical analyzer do and do not(I read some lexical analysers also do syntactic analysis up to a point). Do you think grammar, etc. is technically appropriate? What kind of parsing method I should use(Recursive Descent, LL etc.)? I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.

What should my lexical analyzer do and not do?
It should:
recognize tokens
ignore ignorable whitespace and comments (if there are such things)
optionally, keep track of source location in order to produce meaningful error messages.
It should not attempt to parse the input, although that will be very tempting with such a simple language.
From what I can see, you have the following tokens:
punctuation: /, ., linear-white-space, new-line
numbers
unquoted strings (often called "atoms" or "ids")
quoted strings (possibly the same token type as unquoted strings)
I'm not sure what the syntax for -options is, but that might include more possibilities.
Choosing to return linear-white-space (that is, a sequence consisting only of tabs and spaces) as a token is somewhat questionable; it complicates the grammar considerably, particularly since there are probably places where white-space is ignorable, such as the beginning and end of a line. But I have the intuition that you do not want to allow whitespace inside of a path and that you plan to require it between the command name and its arguments. That is, you want to prohibit:
/left /right[3] .whimper "hello, world"
/left/right[3].whimper"hello, world"
But maybe I'm wrong. Maybe you're happy to accept both. That would be simpler, because if you accept both, then you can just ignore linear-whitespace altogether.
By the way, experience has shown that using new-line to separate commands can be awkward; sooner or later you will need to break a command into two lines in order to avoid having to buy an extra monitor to see the entire line. The convention (used by bash and the C preprocessor, amongst others) of putting a \ as the last character on a line to be continued is possible, but can lead to annoying bugs (like having an invisible space following the \ and thus preventing it from really continuing the line).
From here down is 100% personal opinion, offered for free. So take it for what its worth.
I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
There is a contradiction here, in my opinion. Or perhaps two contradictions.
A technically appropriate piece of work would use standard tools; at least a lexical generator and probably a parser generator. It would do that because, properly used, the lexical and grammatical descriptions provided to the tools document precisely the actual language, and the tools guarantee that the desired language is what is actually recognized. Writing ad hoc code, even simple lexical recognizers and recursive descent parsers, for all that it can be elegant, is less self-documenting, less maintainable, and provides fewer guarantees of correctness. Consequently, best practice is "use standard tools".
Secondly, I disagree with your instructor (if I understand their proposal correctly, based on your comments) that writing ad hoc lexers and parsers aids in understanding lexical and parsing theory. In fact, it may be counterproductive. Bottom-up parsing, which is incredibly elegant both theoretically and practically, is almost impossible to write by hand and totally impossible to read. Consequently, many programmers prefer to use recursive-descent or Pratt parsers, because they understand the code. However, such parsers are not as powerful as a bottom-up parser (particularly GLR or Earley parsers, which are fully general), and their use leads to unnecessary grammatical compromises.
You don't need to write a regular expression library to understand regular expressions. The libraries abstract away the awkward implementation details (and there are lots of them, and they really are awkward) and let you concentrate on the essence of creating and using regular expressions.
In the same way, you do not need to write a compiler in order to understand how to program in C. After you have a good basis in C, you can improve your understanding (maybe) by understanding how it translates into machine code, but unless you plan a career in compiler writing, knowing the details of obscure optimization algorithms are not going to make you a better programmer. Or, at least, they're not first on your agenda.
Similarly, once you really understand regular expressions, you might find writing a library interesting. Or not -- you might find it incredibly frustrating and give up after a couple of months of hard work. Either way, you will appreciate existing libraries more. But learn to use the existing libraries first.
And the same with parser generators. If you want to learn how to translate an idea for a programming language into something precise and implementable, learn how to use a parser generator. Only after you have mastered the theory of parsing should you even think of focusing on low-level implementations.

Why parser-generators instead of just configurable-parsers?

The title sums it up. Presumably anything that can be done with source-code-generating parser-generators (which essentially hard-code the grammar-to-be-parsed into the program) can be done with a configurable parser (which would maintain the grammar-to-be-parsed soft-coded as a data structure).
I suppose the hard-coded code-generated-parser will have a performance bonus with one less level of indirection, but the messiness of having to compile and run it (or to exec() it in dynamic languages) and the overall clunkiness of code-generation seems quite a big downside. Are there any other benefits of code-generating your parsers that I'm not aware of?
Most of the places I see code generation used is to work around limitations in the meta-programming ability of the languages (i.e. web frameworks, AOP, interfacing with databases), but the whole lex-parse thing seems pretty straightforward and static, not needing any of the extra metaprogramming dynamism that you get from code-generation. What gives? Is the performance benefit that great?

If all you want is a parser that you can configure by handing it grammar rules, that can be accomplished. An Earley parser will parse any context-free language given just a set of rules. The price is significant execution time: O(N^3), where N is the length of the input. If N is large (as it is for many parseable entities), you can end with Very Slow parsing.
And this is the reason for a parser generator (PG). If you parse a lot of documents, Slow Parsing is bad news. Compilers are one program where people parse a lot of documents, and no programmer (or his manager) wants the programmer waiting for the compiler. There's lots of other things to parse: SQL querys, JSON documents, ... all of which have this "Nobody is willing to wait" property.
What PGs do is to take many decisions that would have to occur at runtime (e.g., for an Earley parser), and precompute those results at parser-generation time. So an LALR(1) PG (e.g., Bison) will produce parsers that run in O(N) time, and that's obviously a lot faster in practical circumstances. (ANTLR does something similar for LL(k) parsers). If you want full context free parsing that is usually linear, you can use a variant of LR parsing called GLR parsing; this buys you the convienience of an "configurable" (Earley) parser, with much better typical performance.
This idea of precomputing in advance is generally known as partial evaluation, that is, given a function F(x,y), and knowledge that x is always a certain constant x_0, compute a new function F'(y)=F(x0,y) in which decisions and computations solely dependent on the value of x are precomputed. F' usually runs a lot faster than F. In our case, F is something like generic parsing (e.g., an Earley parser), x is a grammar argument with x0 being a specific grammar, and F' is some parser infrastructure P and additional code/tables computed by the PG such that F'=PG(x)+P.
In the comments to your question, there seems to be some interest in why one doesn't just run the parser generator in effect at runtime. The simple answer is, it pays a significant part of the overhead cost you want to get rid of at runtime.

Parsing math rules (with some perks) the same way Soulver (a Mac App) does

Soulver is a great scratch pad for math that allows you to write expressions in a very natural form, which makes it versatile and fun to use in many occasions. There's a short video on the site that displays a lot of its functionality.
I'd like to tackle writing a parser that behaved much as that of that app's. For instance, if you go shopping, you can write a big list like
2 * 1.99 soap + 2.99 cereal + 39.59 organic magic beans
and see, as you type, the sum of what's in the line (46.56).
You can also create variables, such as
March = 2 * 1.99 soap + 2.99 cereal + 39.59 organic magic beans
and reference them in later operations. Other operators, such as 'off' (40% off $200), also exist.
Considering it has some level of sophistication and it should distinguish meaningful terms while ignoring some of the input, what sort of grammar should I be using to represent this little language? I could probably cobble some spaghetti regex together, but I'd honestly like to do something a little better, even if it requires a lot of study from my part. What would you recommend?

A regexp by itself is likely not expressive enough to the job if you want to model real mathematics, e.g., anything with nested parentheses.
Context-free grammars are remarkably expressive. You should learn about Backus Normal Form (BNF), a means for writing down the description of languages as context-free grammars.
You can choose from among many parser generator tools, to convert that grammar into a real parser.
Which specific grammar you write depends on what you want the expressions to mean, and which atoms in the expression really get ignored.
As a practical matter, the way you write the BNF varies from tool to tool, so choosing your parser generator tool first will save you the trouble of rewriting your BNF later.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart