reading/parsing common lisp files from lisp without all packages available or loading everything - parsing

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.

I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.

I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

Related

What's the fastest way to read a large file in Ruby?

I've seen answers to this question but I couldn't figure out which of the answers would perform the fastest. These are the answers I've seen- which is best?
Read one line at a time using each or each_line
Read one line at a time using gets
Save it all into an array of lines using readlines and then use each
Use grep (not sure what exactly to do with grep...)
Use sed (not sure what exactly to do with sed...)
Something else?
Also, would it be better to just use another language or should Ruby be fine?
EDIT:
More details: Each line contains something like "id1 attr1_1 attr2_1 id2 attr1_2 attr2_2... idn attr1_n attr2_n" (n is very big) and I need to insert those into a database. For that example line, I would need to insert n rows into the database.
Ruby will likely be using the same or very similar low-level code (written in C) to do the actual reading from disk for the first three options, so they should perform similarly. Given that, you should choose whichever is most convenient for you; the ability to do that is what makes languages like Ruby so useful! You will be reading a lot of data from disk, so I would suggest using each_line and processing each line as you read it.
I would not recommend bringing grep, sed, or any other such external utilities into the picture unless you have a very good reason, as they will make your code less portable and expose you to failures that may be difficult to diagnose.
If you're using Ruby then there's no need to worry about performance. The language is such that it suits an iterative approach to reading a file, line by line, and works very nicely. So long as you're using the language the way it's designed you can let the interpreter people worry about performance. Job done.
If one particular readLargeFileFast method is needed then it should be because it's really hindering the program somehow. Now, you write a C program to do it and popen it as a separate process within your ruby code. You could call it read_large.c and (perhaps) use command line arguments to tell it how to behave.
This is championing the idea that a scripting language is used for a fast development rather than a fast run time. As such a developer can be very productive by swiftly 'prototyping' a program in something like Ruby and only later rewriting the components warrant some low level code. Often, however, once it's working in script, it's not necessary to do anything else at all.
The Ruby Docs describe launching a separate process and treating it as a file. It's easy-peasy! A good start is The Art of Linux Programming's introductory paragraph on program modularity. This book also makes a great example of using linux's standard stream editor, called sed, which you could probably use from Ruby right now.
If you need to parse or edit a lot of text then many interpreters or editors have been written around sed's functionality. Further, it may save you a lot of effort writing something super efficient if you don't know C. Good is the Introduction to SED by Bruce Barnett.

File Path Name or URL analysis

I am looking for information on tools, methods, techniques for analysis of file path names. I am not talking file size, read/write times, or file types, but analysis of the path or URL it self.
I am only aware of basic word frequency text tools or methods, but I am wondering if there is something more advanced that people use/apply to this to try and mine extra information out of them.
Thanks!
UPDATE:
Here is the most narrow example of what I would want. OK, so I have some full path names as strings like this:
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File5.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File5.doc
What I want to know is that the folder MapShedMaps appears "uniquely" 2 times. If I do frequency on the strings I would get 10 appearances. The issues is that I don’t know what level in the directory this is important, so I would like a unique count at each level of the directory based on what I am describing.
This is an extremely broad question so it is difficult for me to give you a per say "Answer" but I will give you my first thoughts on this.
First,
the Regular expression class of .NET is extremely useful for parsing large amounts of information. It is so powerful that it will easily confuse the impatient, however once mastered it can be used across text editors, .NET and pretty much any other respectable language I believe. This would allow you to search strings and separate it into directories. This could be overkill depending on how you use it, but its a thought. Here is a favorite link of mine to try out some regular expressions.
Second,
You will need a database, I prefer to use SQL. Look into how to connect to databases and create databases. With this database you can store all the fields abstracted from your original path entered. Such as a parent directory, child directory, common file types accessed. Just have a field for each one of these and through queries you can form a hypothesis as to redundancy.
Third,
I don't know if its easily accessible but you might look into whether windows stores accessed file history. It seems to have some inkling as to which files have been opened in the past. So there may be a resource in windows which already stores much of the information you would be storing in your database. If you could find a way to access this information. Parse it with regular expressions and resubmit it to the database of your application. You could control the WORLD! j/k... You could get a pretty good prediction as to user access patterns though.
Fourth,
I always try to stick with what I have available. If .NET is sitting in front of you, hammer away at what your trying to do. If you reach a wall. At least your making forward progress. In today's motion towards object orientated programming, you can usually change data collected by one program into an acceptable format for another. You just gotta dig a little.
Oh and btw, Coursera.com is actually doing a free class on machine learning and algorithms. You might want to check it out or reference it for prediction formulas.
Good Luck.
I wanted to post this as a comment but SO kept editing the double \ to \ and it is important there are two because \ is a key character, without another \ to escape it, regex will interpret it as a command.
Hey I just wanted to let you know I've been playing with some regex... I know a pretty easy way to code this up in VB.net and I'll post that as my second answer but I wanted you to check out back-references. If the part between parenthesis matches it captures that text and moves on to the second query for instance....
F:\\(directory1)?(directory2)?(directory3)?
You could use these matches to find out how many directories each parent directory has under it. Are you following me? Here is a reference.

Relationship between parsing, highlighting and completion

For some time now I've been thinking about designing a small toy language from scratch, nothing that will "Rule The World", but mostly as an exercise. I realize there is a lot to learn in order to accomplish this.
This question is about three different concepts (parsing, code highlighting and completion) that strike me as extremely similar. Of course, parsing and ASTgen is part of the compilation, while code highlighting and completion is more of a feature of the IDE, yet I wonder what are the similarities and differences.
I need some hints from someone more experienced in this topic. What code can be shared between these concepts and what are the architecture considerations that could help in this sense?
What you want is a syntax-directed structure editor. This is one that combines parsing with AST building and uses the parser to predict what you can type next (either syntax completion), or has a tie to the compiler's last run, so that it can interpret the edit point to see what valid identifiers might come next by inspecting the compiler's symbol table that was last relevant at that point in the code.
The most difficult part is offering the user a seamless experience; she pretty much has to believe she is editing text or (experience with structure editors shows) she will reject it as awkward.
This is a lot of machinery to coordinate and quite a big effort. The good news is that you need a parser anyway for the compiler; if editing also parses, the AST needed by the compiler is essentially available. (Of course you have to worry about batch compiling, too). The compiler has to build a symbol table; so you can use that in the editing completion process. The more difficult news is that the parsers are a lot harder to build; they can't just declare a user-visible syntax error and quit; rather they have to be tolerant of a number of errors extant at the same moment, hold partial ASTs for the pieces, and stitch them together as the errors are removed by the user.
The Berkeley Harmonia people are doing good work in this area. It is well worth your trouble to read some of their papers to get a detailed sense of the problems and one approach to handling them.
THe other major approach people (notably Intentional Programming and XText) seem to be trying are object-oriented editors, where you attach editing actions to each AST node, and associate every point on the screen with an AST node. Then editing actions invoke AST-node specific actions (insert-character, go right, go up, ...) and it can decide how to act and how to modify the screen. Arguably you can make these editors do anything; its a little harder in practice. I've used these editors; they don't feel like text editors. There are some enthusiastic users, but YMMV.
I think you probably ought to choose between trying to build such an editor, vs. trying to define a new langauge. Doing both at once is likely to overwhelm you with troubles.

Parsing commands from user input

This question is intended to be a discussion of people's personal opinions in handling user input.
This portion of the project that I am working on handles user input in a manner similar to an IRC chat. For instance, there are set commands and whatnot, for chatting, executing actions, etc.
Now, I have several options to choose from for parsing this input. I could go with regular expressions, I could parse it directly (ie a large switch statement with all supported commands, simply checking the first x number of characters in the user input), or could even go crazy and add in a parser similar to Flex/Bison implementations. One other option I was considering was defining all commands in an XML file to separate them from the code implementation.
So, what are the thoughts of the community?
I'd go with a nice mixed bag of all.
Obviously you'll have to sanitize the input. Make sure there's no nasty stuff there, depending on where the input is going to prevent SQL injection, XSS, CSRF etc...
But when your input is clean, you could go with a regexp that catches the ones intended as command and gets all necessary submatches (command parameters etc.) and then have some sort of dispatcher-switch statement or similar.
There really is no cover-all best practice here, apart from always always and quadruple-always making sure user input is sanitized. Apart from that, go with what seems to fit best for your case.
Obviously there are those that say if you've got a problem and you're thinking of using reg exps to solve said problem, you've got two problems, but used cautiously, they're the best thing ever. Just remember that regexp-monsters can read to really poor readability really quick.

Suggestions on how to make a configurable parser

I want to build a parser for a C like language. The interesting aspect about it is that I want to build it in such a way that someone who has access to the source can easily modified it to extend the language (a new expression type of instance) with the extensions being runtime configurable (they can be turned on and off).
My current intent is to build a recursive decent parser as an object. Each production will be a method of an object. The method of extension will be to derive classes from this base replacing methods (and production definitions) as needed. I'm still trying to figure out how to mix and match extensions. One idea is to play games with the v-tbl. Objects would be constructed with a v-tbl that is a copy of the base but with methods replaced from derived classes.
Aside from the bit-twiddling nature of the solution the only issues I have with it is
a reasonable way to do the v-tbl mixup
what to do when 2 extensions alter the same productions (as most replacements will end up calling the original having one replacement call the other would work but the mechanics of setting this up are the issue)
how to allow the extension of extensions (this might end up looking like a standard MI system, but I've never got how they work)
Another solution (a slightly more mundane version of the same same approach) would be to use static member variables to store function-pointers and call them for the same effect.
Edit: I have already built a system that lets me build productions from BNF definitions. I can alter it to support whatever I decide on.
These are some of the challenges the Perl 6 design effort has faced. You may find it worthwhile looking into some of the solutions they came up with. Or you may find that to be gross overkill.
I made a configurable parser I uploadei it some time ago at
http://code.google.com/p/compparser/
The project there is not up-to-date but is working fine.
If I recall my university courses correctly, recursive descent parsers have some limitations that might bite you, especially since you're allowing extensions - somebody elses language extension could cause issues.
A proper compiler toolkit - such as the open source ANTLR - might make things easier, and might also provide some different approaches for you.
another option is to express the parsing rules in XML or something, instead of in code; less efficient, but far more dynamically configurable; each language or variant can just use its own (XML) file, and even include/reference other files as 'base' files...
Frankly, I am not even sure I understood everything you wrote... :-)
But when I see parser and flexibility, I think about LPeg - Parsing Expression Grammars For Lua. It might not fit your needs but it is well worth a look... ;-)

Resources