Parsing commands from user input - parsing

This question is intended to be a discussion of people's personal opinions in handling user input.
This portion of the project that I am working on handles user input in a manner similar to an IRC chat. For instance, there are set commands and whatnot, for chatting, executing actions, etc.
Now, I have several options to choose from for parsing this input. I could go with regular expressions, I could parse it directly (ie a large switch statement with all supported commands, simply checking the first x number of characters in the user input), or could even go crazy and add in a parser similar to Flex/Bison implementations. One other option I was considering was defining all commands in an XML file to separate them from the code implementation.
So, what are the thoughts of the community?

I'd go with a nice mixed bag of all.
Obviously you'll have to sanitize the input. Make sure there's no nasty stuff there, depending on where the input is going to prevent SQL injection, XSS, CSRF etc...
But when your input is clean, you could go with a regexp that catches the ones intended as command and gets all necessary submatches (command parameters etc.) and then have some sort of dispatcher-switch statement or similar.
There really is no cover-all best practice here, apart from always always and quadruple-always making sure user input is sanitized. Apart from that, go with what seems to fit best for your case.
Obviously there are those that say if you've got a problem and you're thinking of using reg exps to solve said problem, you've got two problems, but used cautiously, they're the best thing ever. Just remember that regexp-monsters can read to really poor readability really quick.

Related

reading/parsing common lisp files from lisp without all packages available or loading everything

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

How to check for parameter presence: params.require(:data) or params[:data].present? and why

What is the best way of checking for parameter presence inside of a controller action:
params.require(:data) # will raise an ActionController::ParameterMissing error if the parameter is not found in the request
params[:data].present? # check if something is in the parameter
Which should be the preferred way, and why?
The first way will throw an error, which will need to be handled. The second way can be used in a conditional. Both ways could be used to enable the controller action to return an error message that the parameter is missing.
It seems to me that you may have a slight misunderstanding in the purpose of params.require.
If your aim is to simply check for the existance of a certain parameter to organise the flow of code, then yes always use .present?
The core aim of .require and Strong Params is not to check for existance of a parameter or not, but to allow for the creation of a whitelist for your parameters, and raise an exception when any parameter not specifically whitelisted attempts to pass, or an essential parameter is missing. a very different purpose altogether. I suggest a look at StrongParameters api.
The general aim for StrongParams and .require / .permit is to protect your database, they are used to protect your data when mass-assigning to a model. As I am sure you are aware with an in production project, your (or your customers) data is everything, and NOT using a whitelist for parameters is an extremely irresponsible thing to be doing.
EDIT
It seems my point may have been taken the wrong way due to the bulk of it talking about StrongParams, so let me explain a little more.
What i was hoping you would understand through the explanation of the purpose of strong params, is that it is not a viable choice for use in a mere presence check. The other answers have also stated that they recommend the use of .present?, however the reasoning is not a solid foundation for why this is better.
Code readability is of course important, but code readability has nothing to do with handling errors, you can very easily write code that is readable and handles exceptions. therefore code readability shouldn't even come into the equation, because that is just based on how well you can write your exception handling logic.
And as for not wanting errors on a production app, that is a completely counter intuitive point, because the whole purpose of throwing and handling exceptions is exactly for that reason, to stop errors occurring on a production app, avoiding silent logic that may have small unexpected bugs and could cause problems down the line.
The core principle as to why .present? is the obvious choice is in the question itself:
How to check for parameter presence
All code is written for a purpose, to solve a given problem, or to provide functionality suitable for a given need. So what should be considered is what is the purpose of .present? and what is the purpose .require. The purpose of .present? is fairly simple, checking for presence and returning a Boolean result in either case (meaning you can put it together with other tests using ||, not possible with .require), it fulfills the needs of the question.
Now lets take a look at .require, which is a feature from ActionController::StrongParameters module. A quick look of the API states the purpose of the StrongParameters module:
It provides an interface for protecting attributes from end-user assignment.
Is this the purpose that you are seeking, now of course the question you asked is quite a broad one, and its difficult to read the intent from it, but from the bare question "I want to check parameter presence", we can conclude that .require or any StrongParameters features is not purposed for this.
Whew! that was long, sorry for the mini Essay there, but I want to bring focus to the fact that, while the other answers bring some topics are valid and should be considered, (not just in this question but in all code). I hope to bring to your attention that there is WAY more to consider than simple code readability or not wanting to handle errors, which can both be solved very easily in a variety of solutions. Even small simple decisions like this should be considered with a perspective that has much more core principles of programming in mind as well.
Long story short: I can hammer a nail in with the handle of a screwdriver, but that is not its purpose, and if I keep doing it i'm eventually gona have a bad time.
I believe you answered your own question. If you prefer to handle exceptions, use require. If you like to use conditions instead, use present?.
For me the second way is preferred cause it reduces the chance that my app will crash on production (if I failed to handle exceptions properly).
It depends on whether you want to raise an error in regards to if a parameter is present or not.
I would personally say to go for .present in the sense that you could always use it in a conditional, and allows for better code readability, vs. alternative of the messy, and complicated nature of handling errors.
Even under the hypothetical assumption that the parameter was not present, you could still render another page should you choose. IMO, in the specific case where you have a simply looking for the presence of an error, really don't see a reason why you'd want the application to throw an error vs. false. Even the worse-case hypothetical situation of a parameter not being present when using .present? can be appropriately handled by a simple if and else statement:
if (some value).present?
render("some_page")
else
render("error_page")
end

File Path Name or URL analysis

I am looking for information on tools, methods, techniques for analysis of file path names. I am not talking file size, read/write times, or file types, but analysis of the path or URL it self.
I am only aware of basic word frequency text tools or methods, but I am wondering if there is something more advanced that people use/apply to this to try and mine extra information out of them.
Thanks!
UPDATE:
Here is the most narrow example of what I would want. OK, so I have some full path names as strings like this:
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_01_NYS\Models\MapShedMaps\Random_File5.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File1.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File2.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File3.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File4.doc
F:\Task_Order_Projects\TO_02_NYS\Models\MapShedMaps\Random_File5.doc
What I want to know is that the folder MapShedMaps appears "uniquely" 2 times. If I do frequency on the strings I would get 10 appearances. The issues is that I don’t know what level in the directory this is important, so I would like a unique count at each level of the directory based on what I am describing.
This is an extremely broad question so it is difficult for me to give you a per say "Answer" but I will give you my first thoughts on this.
First,
the Regular expression class of .NET is extremely useful for parsing large amounts of information. It is so powerful that it will easily confuse the impatient, however once mastered it can be used across text editors, .NET and pretty much any other respectable language I believe. This would allow you to search strings and separate it into directories. This could be overkill depending on how you use it, but its a thought. Here is a favorite link of mine to try out some regular expressions.
Second,
You will need a database, I prefer to use SQL. Look into how to connect to databases and create databases. With this database you can store all the fields abstracted from your original path entered. Such as a parent directory, child directory, common file types accessed. Just have a field for each one of these and through queries you can form a hypothesis as to redundancy.
Third,
I don't know if its easily accessible but you might look into whether windows stores accessed file history. It seems to have some inkling as to which files have been opened in the past. So there may be a resource in windows which already stores much of the information you would be storing in your database. If you could find a way to access this information. Parse it with regular expressions and resubmit it to the database of your application. You could control the WORLD! j/k... You could get a pretty good prediction as to user access patterns though.
Fourth,
I always try to stick with what I have available. If .NET is sitting in front of you, hammer away at what your trying to do. If you reach a wall. At least your making forward progress. In today's motion towards object orientated programming, you can usually change data collected by one program into an acceptable format for another. You just gotta dig a little.
Oh and btw, Coursera.com is actually doing a free class on machine learning and algorithms. You might want to check it out or reference it for prediction formulas.
Good Luck.
I wanted to post this as a comment but SO kept editing the double \ to \ and it is important there are two because \ is a key character, without another \ to escape it, regex will interpret it as a command.
Hey I just wanted to let you know I've been playing with some regex... I know a pretty easy way to code this up in VB.net and I'll post that as my second answer but I wanted you to check out back-references. If the part between parenthesis matches it captures that text and moves on to the second query for instance....
F:\\(directory1)?(directory2)?(directory3)?
You could use these matches to find out how many directories each parent directory has under it. Are you following me? Here is a reference.

Relationship between parsing, highlighting and completion

For some time now I've been thinking about designing a small toy language from scratch, nothing that will "Rule The World", but mostly as an exercise. I realize there is a lot to learn in order to accomplish this.
This question is about three different concepts (parsing, code highlighting and completion) that strike me as extremely similar. Of course, parsing and ASTgen is part of the compilation, while code highlighting and completion is more of a feature of the IDE, yet I wonder what are the similarities and differences.
I need some hints from someone more experienced in this topic. What code can be shared between these concepts and what are the architecture considerations that could help in this sense?
What you want is a syntax-directed structure editor. This is one that combines parsing with AST building and uses the parser to predict what you can type next (either syntax completion), or has a tie to the compiler's last run, so that it can interpret the edit point to see what valid identifiers might come next by inspecting the compiler's symbol table that was last relevant at that point in the code.
The most difficult part is offering the user a seamless experience; she pretty much has to believe she is editing text or (experience with structure editors shows) she will reject it as awkward.
This is a lot of machinery to coordinate and quite a big effort. The good news is that you need a parser anyway for the compiler; if editing also parses, the AST needed by the compiler is essentially available. (Of course you have to worry about batch compiling, too). The compiler has to build a symbol table; so you can use that in the editing completion process. The more difficult news is that the parsers are a lot harder to build; they can't just declare a user-visible syntax error and quit; rather they have to be tolerant of a number of errors extant at the same moment, hold partial ASTs for the pieces, and stitch them together as the errors are removed by the user.
The Berkeley Harmonia people are doing good work in this area. It is well worth your trouble to read some of their papers to get a detailed sense of the problems and one approach to handling them.
THe other major approach people (notably Intentional Programming and XText) seem to be trying are object-oriented editors, where you attach editing actions to each AST node, and associate every point on the screen with an AST node. Then editing actions invoke AST-node specific actions (insert-character, go right, go up, ...) and it can decide how to act and how to modify the screen. Arguably you can make these editors do anything; its a little harder in practice. I've used these editors; they don't feel like text editors. There are some enthusiastic users, but YMMV.
I think you probably ought to choose between trying to build such an editor, vs. trying to define a new langauge. Doing both at once is likely to overwhelm you with troubles.

TFS Check in rules

Hi I'm out of setting up an TFS server and I want to set some check-in rules.
I for example want to be able to set rules about method lenght, complexity and so on, I found NDepend very convenient can I somehow use NDepend to run some rules on the files trying to check in.
I also want to be able to bypass the rules sometimes.
Are there any blogs or discussions around this, if it wont work with NDepend are there any other tools or ways I can use?
I would be very careful about this. I worked at a place once that had strict method length rules. If Calculate(a,b,c) ended up 1.5 times the limit length, the devs would just move the last third of the function into Calculate2() and call it from Calculate(). All the active locals would become parameters, of course - sometimes there would be a dozen of them. The resulting mess passed the automated tests for method length but were definitely not better or more maintainable than the long methods would have been.
Would it have been nice if the devs had spotted something refactorable in the middle of the method, pulled it out and given it a good name? Yes it would. But systems are all game-able, and the sorts of "dammit I just want to check in and go home" changes that are made to comply with method length rules (among others) make the code worse. A lot worse.
Also to bypass the rules there's a way on the checkin to say you are bypassing and why.

Resources