Why is using an AST faster than not using one? - parsing

I'm making an interpreter for my own language as a hobby project. Currently my interpreter just executes the code as it sees it. I've heard you should make the parser generate an AST from the source code. So I was wondering, how does an AST actually make things faster than just executing the code linearly, as the parser sees it?

Because then you would have to do the parsing all the time. If you have a loop for instance, you'd have to parse the commands in the loop body over and over again.
Also, I would argue that it's cleaner since you break down the problem in two distinct tasks: Deal with syntax, then deal with semantics.

It isn't specifically the "AST" that makes it faster.
It is using any data structure (AST, symbol tables, control flow graph, triples, p-codes, machine code) that caches the analysis of source code to extract its intended meaning, and as much of precomputation of the answer ("optimization"), as possible. In effect, anything that partially compiles the code, should produce programs that run faster than an interpreter of the pure text.
In interesting tradeoff: if the amount of program being executed before the execution stops isn't very big, it may actually be cheaper to execute the text, than to do any compiler-style analysis.
Given the speed of machines these days, one can sloppily compile a pretty big program in 100 milliseconds, which is about as fast as a human can react. Various versions of TurboPascal back in the 80s and 90s were pretty famous for this.

Related

Is there a sandboxable compiled programming language simililar to lua

I'm working on a crowd simulator. The idea is people walking around a city in 2D. Think gray rectangles for the buildings and colored dots for the people. Now I want these people to be programmable by other people, without giving them access to the core back end.
I also don't want them to be able to use anything other than the methods I provide for them. Meaning no file access, internet access, RNG, nothing.
They will receive get events like "You have just been instructed to go to X" or "You have arrived at P" and such.
The script should then allow them to do things like move_forward or how_many_people_are_in_front_of me and such.
Now I have found out that Lua and python are both thousands of times slower than compiled languages (I figured it would be in order of magnitude of 10s times slower), which is way to slow for my simulation.
So heres my question: Is there a programming language that is FOSS, allows me to restrict system access (sandboxing) the entire language to limit the amount of information the script has by only allowing it to use my provided functions, that is reasonably fast, something like <10x slower than Java, where I can send events to objects inside that language with which I can load in new Classes/Objects on the fly.
Don't you think that if there was a scripting language faster than lua and python, then it'd be talked about at least as much as they are?
The speed of a scripting language is rather vague term. Scripting languages essentially are converted to a series of calls to functions written in fast compiled languages. But the functions are usually written to be general with lots of checks and fail-safes, rather than to be fast. For some problems, not a lot of redundant actions stacks up and the script translation results in essentially same machine code as the compiled program would have. For other problems, a person, knowledgeable about the language, might coerce it to translate to essentially same machine code. For other problems the price of convenience stay forever with the script.
If you look at the timings of benchmark tasks, you'll find that there's no consistent winner across them. For one task the language is fastest, for the other it is way behind.
It would make sense to gauge language speed at your task by looking at similar tasks in benchmarks. So, which of those problem maps the closest to yours? My guess would be: none.
Now, onto the question of user programs inside your program.
That's how script languages came to existence in the first place. You can read up on why such a language may be slow for example in SICP.
If you evaluate what you expect people to write in their programs, you might decide, that you don't need to give them whole programming language. Then you may give them a simple set of instructions they can use to describe a few branching decisions and value lookups. Then your own very performant program will construct an object that encompasses the described logic. This tric is described here and there.
However if you keep adding more and more complex commands for users to invoke, you'll just end up inventing your own language. At that point you'll likely wish you'd went with Lua from the very beginning.
That being said, I don't think the snippet below will run significantly different in compiled code, your own interpreter object, or any embedded scripting language:
if event = "You have just been instructed to go to X":
set_front_of_me(X) # call your function
n = how_many_people_are_in_front_of_me() #call to your function
if n > 3:
move_to_side() #call to function provided by you
else:
move_forward() #call to function provided by you
Now, if the users would need to do complex computer-sciency stuff, solve np-problems, do machine learning or other matrix multiplications, then yes, that would be slow, provided someone would actually trouble themselves with implementing that.
If you get to that point, it seem that there are at least some possibilities to sandbox the compiled dlls (at least in some languages). Or you could do compilation of users' code yourself to control the functionality they invoke and then plug it in as a library.

Language lexing: better performance to lex a string all at once or individually?

I'm attempting to build my first C-like programming language, likely an interpreter and I've just made the first step aka the lexer.
I've thought about taking the lazy route by simply lexing the entire source code stream all at one then then have the parser process the data.
I've noticed that many other compilers and interpreters only lex during parsing when the parser module asks for another token.
Is it quicker in terms of code performance for a program to lex source code all at once then parse the resulting tokens or lex and parse tokens individually?
"faster" is a bit of a fuzzy word. There are different kinds of speed (latency, absolute start-to-finish duration, compile speed, execution speed), and depending on how you implement your language's front-end and backend, either approach can be faster.
Also, faster is not always better. If your parser is technically faster, but uses too much memory, it could crash or at the least end up swapping, which would slow it down again. If your parser is lightning-fast but produces inefficient code, your users will pay for your faster development speed. You'll have to write actual code and run it in a profiler to be able to tell what is really better, and come up with which criteria are important to you.
Tokenizing/Lexing everything at once at the start means you might be able to optimize memory allocation and thus take less time resizing your token list etc., but it also means the entire file has to be lexed before it can even be partially parsed.
OTOH if you parse as-needed, you may have to append to your arrays in small steps more often, so you'll pay a memory penalty, but in case of e.g. an interpreted language like JavaScript, you may only have to parse the parts that are actually used for this run-through.
So a lot of it depends on the details of your language, and the hardware you expect to be running on. In embedded systems with little memory and no swap, you may have no choice but to progressively lex, as the whole program source code might not fit in memory. If your language's syntax needs a lot of lookahead, you may not see any benefit of progressively lexing because you're reading it all anyway...

How to parse input text file concurrently using YACC parser?

and also I want to compare the result of concurrent parsing with serial parsing .
How to do this? Which one is the simplest approach?
As others have said, YACC (based on LR parsing) by itself isn't "concurrent".
(It is my understanding that YACC isn't even re-entrant, which will make it pretty hard to use it in multithreaded context no matter what you do. The non-reentrancy can presumably be overcome with mere sweat, so this is an annoyance, not a show stopper.)
One idea is to construct a pipeline, allowing a lexer to generate lexemes into a steam as fast as it can, and letting the actual parser read from the stream. This can get you at best a factor of 2. You might be able to do this with YACC relatively easily, modulo setting up communicating threads.
McKeeman et al have implemented parallel LR parsing by dividing the file into N chunks for N threads and claim to have gotten good results. The approach isn't simple, because dividing parsing of a single file into parallel chunks of about the same size, and stitching those chunks together, isn't easy. I doubt that one could easily hack up YACC to do this.
A screwy idea is parse a file from both ends toward the middle.
Its easy enough to define the backward parser grammar from the "natural" forward one: just reverse all the grammar rule content. Nothing is easy; this might introduce ambiguities not present in the forward parser. This paper combines McKeeman's idea of breaking the file into chunks with bidirectional parsing of each chunk, enabling one to find a lot of parallelism on a big file.
Easier to do is to parallelize the parsing of individual files, using whatever parsing technology you have. This does parallelize relatively well, although the parsing time for individual files may not be even, so this is best done with some kind of worklist and a team parser threads taking work from this worklist. You could probably organize to do this with YACC.
Our DMS Software Reengineering Toolkit (DMS) uses a generalization of this when reading large systems of Java sources and classes. Parsing by itself isnt very useful; you almost always need to build symbol tables, too. DMS reading Java thus parallelizes both parsing, AST building and symbol table construction. Starting from a base set of filenames, it launches parses in parallel grains (multiplexed by DMS on top of OS threads); when a parser completes, the parsed tree is handed to name resolver, which splits into a parallel grain per parallel scope encountered. Nested scopes cause a tree of grains to be generated. Further work is gated by treating resolution of a scope as a future (event); while the scope is being resolved, more Java files parse/name resolution activities may be launched; when a scope is resolved, an event is signalled and grains waiting for scope completion can then inspect the scope content to resolve their own names. The tangle of (potential) parallelism in the middle of this is almost frightening :-} but is managed by DMS's underlying parallel programming language, PARLANSE, which uses work-stealing to balance the load across the threads.
Experience shows that doing this in production with 8 cores leads to a 5x speedup over sequential for a few thousand (typical/Java implies small) source files. We think we know where the bottleneck is; there are parts of the name resolver that are more expensive than it should be in terms of excess synchronization in an attribute grammar. I suspect we can get this closer to 8. So, parallel parsing and name resolution can be pretty effective.
We don't do quite as well with C++14, because of all the dependencies of individual files on #includes that it reads, often in different preprocessor configurations.
If the input allows splitting into chunks, like, e.g. log file lines, which can be parsed in parallel, then you can parse a certain number such chunks in parallel using a producer/consumer queue and then join the parse results.

F# Workflows/Development Process

(I'm using the word "workflow" - not in the sense of async workflows - but rather in the "git workflow" sense, that is, how you use it as part of your development)
Having played around with F# for a while, I've started developing my first F# app. I'm from c#/vb. Having watched various demos/talks - rightly or wrongly- I've started off using fsi as the main development "engine" and work up stuff within that area. If I hit a problem which I need to debug, I tend to break out the problematic function into smaller bits and check those work to try and debug the problem.
However, In order to keep the amount of code manageable in fsi, once I am happy with what I have done, I the move it into a .fs and #load the .fs back into fsi. As the app gets bigger, this can begin to feel a bit clunky since when I need to refactor, I end up having to bring back in content from the fs file change it run stuff to get something working again, before pushing the code back out into the .fs file. Further this style isn't really a test first approach and so I am not getting the benefit of building a set of tests. (I can also miss the ability to set breakpoints/step the code which, istm in certain situations e.g. recursion, can be quicker for diagnosing errors than breaking out parts of a function - though maybe this is available in VS11 and I'm not setup right) .. so I think I'm perhaps not doing things optimally or else not thinking about things in the right way.
I was wondering if others could offer how they develop apps. Do you primarily use fsi or do you start with tdd. Should the tdd approach be the primary dev vehicle and FSI used more selectively to aid in the, say, implementation of more complex algorithms, data exploration etc etc
I have looked at this question and obviously it helpfully points to various tdd frameworks for F#, but I'd still be interested to find out the workflow of seasoned F# developers.
Many thx
S
I think you're on the right track.
Development process is a matter of taste. I'll share my approach anyway.
Start by a few fs files. Each file represents a module, which consists of a group of functions closely related to each other. It doesn't have to be precise from beginning; you often move stuffs between modules.
Create a few fsx files for quick testing once skeleton of the modules is ready.
Create a test project and set up NuGet packages. I often use NUnit and FsUnit together.
Whenever fsx scripting gives correct results, move them to test cases. Do this repeatedly.
Include a Program.fs into the main project and compile to executable in order to debug if needed.
In general, F# REPL is the main development engine. It gives me instant feedbacks and allows incremental changes, which are very helpful in prototyping. In F#, TDD is less critical since bug rate is much lower than in other languages. And I don't test everything, just focus on main functionalities and ensure a high test coverage. Using testdriven.net add-in or Visual Studio 2012 Premium and Ultimate can give you useful statistics on test coverage.
Using F# REPL and TDD, I almost never have to use debugging. Whenever there is a wrong behaviour, I stop and think. Since your codes don't have side effects, you can reason on them easily. In many times reasoning and a few printing commands can give me the right answer.
You can use TDD in F# REPL with Unquote and FsCheck. The former offers testing via quotations, which is quite impressive. The latter uses random testing approach which is attractive in handling corner cases of your codes. I find it really useful when your programs have to satisfy certain properties. However, it may take time to learn to use these frameworks properly.
pad gave a great answer that is very practical and useful for a person new to F#. I will give a different means so that others don't think there is only one way F#'ers do it.
Note: If you are very new to programming, then stick with pad's answer, it is much better for a new programmer.
In the Object Oriented world one thinks with objects and in such languages I would start with writing objects down on paper and working with various diagrams such as use-case, state transition, sequence diagram, etc., until I felt I had enough to start creating objects in C# cs files, fleshing out the objects with methods, properties, events, etc.
In the functional world I typically start with abstract concepts and convert them into discriminated unions (DU) in an F# fs file, skipping the use of a REPL, i.e. F# Interactive, and then start adding a few functions. After I have a few functions I set up a test project using NUnit and FsUnit via NuGet. Since the DU are abstract, the test cases are typically harder to write, so I create printers for the DU and then insert them into the test case where I capture result output from the printer in the NUnit tool, for cut and paste back into the test case making changes as necessary. See these for examples of why I don't write them from scratch by hand.
Once I have the abstract DU done, I then can move onto the code to convert the human/concrete form into the abstract DU and convert the abstract DU into human/concrete form. In some cases these would be parsers and pretty printers.
The main point I am trying to make is that I don't focus on the tools I use but on the abstract concept of the problem and bring the tools in when needed.
I will note that I also program in PROLOG and there I do start with the REPL and move the code to a store once the logic works. So I am not opposed to using a REPL, it's just a different way of approaching the problem.
EDIT
Per a request by Ken for an example.
See: Discriminated Unions (F#) and look for the section
Using Discriminated Unions Instead of Object Hierarchies
So instead of a base type of shape with inherited types of Circle, EquilateralTriangle, Square and Rectangle one would create a discriminated Union as noted:
type Shape =
| Circle of float
| EquilateralTriangle of double
| Square of double
| Rectangle of double * double
As your question would make for a much better independent question and get answers with much better detail than I can give, I would suggest you ask it.
Also if you search for info on this also search with the following substitutions for discriminated union (DU):
Algebraic data type
Generalized algebraic data type (GADT)
Tagged union
Variant
variant record
disjoint union
sum type

What's the fastest way to read a large file in Ruby?

I've seen answers to this question but I couldn't figure out which of the answers would perform the fastest. These are the answers I've seen- which is best?
Read one line at a time using each or each_line
Read one line at a time using gets
Save it all into an array of lines using readlines and then use each
Use grep (not sure what exactly to do with grep...)
Use sed (not sure what exactly to do with sed...)
Something else?
Also, would it be better to just use another language or should Ruby be fine?
EDIT:
More details: Each line contains something like "id1 attr1_1 attr2_1 id2 attr1_2 attr2_2... idn attr1_n attr2_n" (n is very big) and I need to insert those into a database. For that example line, I would need to insert n rows into the database.
Ruby will likely be using the same or very similar low-level code (written in C) to do the actual reading from disk for the first three options, so they should perform similarly. Given that, you should choose whichever is most convenient for you; the ability to do that is what makes languages like Ruby so useful! You will be reading a lot of data from disk, so I would suggest using each_line and processing each line as you read it.
I would not recommend bringing grep, sed, or any other such external utilities into the picture unless you have a very good reason, as they will make your code less portable and expose you to failures that may be difficult to diagnose.
If you're using Ruby then there's no need to worry about performance. The language is such that it suits an iterative approach to reading a file, line by line, and works very nicely. So long as you're using the language the way it's designed you can let the interpreter people worry about performance. Job done.
If one particular readLargeFileFast method is needed then it should be because it's really hindering the program somehow. Now, you write a C program to do it and popen it as a separate process within your ruby code. You could call it read_large.c and (perhaps) use command line arguments to tell it how to behave.
This is championing the idea that a scripting language is used for a fast development rather than a fast run time. As such a developer can be very productive by swiftly 'prototyping' a program in something like Ruby and only later rewriting the components warrant some low level code. Often, however, once it's working in script, it's not necessary to do anything else at all.
The Ruby Docs describe launching a separate process and treating it as a file. It's easy-peasy! A good start is The Art of Linux Programming's introductory paragraph on program modularity. This book also makes a great example of using linux's standard stream editor, called sed, which you could probably use from Ruby right now.
If you need to parse or edit a lot of text then many interpreters or editors have been written around sed's functionality. Further, it may save you a lot of effort writing something super efficient if you don't know C. Good is the Introduction to SED by Bruce Barnett.

Resources