Related
I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.
We have huge code base and we are generating issues that would have been caught at compile time in type languages such as Java but we are not catching them until runtime in Ruby. This is bad since we generate bugs that most of the time are typos or refactoring that leaves some invalid code.
Example:
def mysuperfunc
# some code goes here
# this was a valid call but not anymore since enforcesecurity
# signature changed
#system.enforcesecurity
end
I mean, IDEs can do it but some guys use ATOM or sublime, so we need something to "compile" and report that kind of issues so they don't reach deployment. What have you been using?
This is generating a little percentage of our bug reports, but since we are forced to produce at a ridiculous pace we don't have 100% code coverage. If there is no tool to help, I'll just make sure everybody uses and IDE and run the reports with tools such as Rubymine.
Our stack includes, rspec, minitest, SimpleCov. We enforce code reviews, multistack deployments (dev, qa, pre-prod, sandbox, prod). And still some issues are reaching higher level and makes us programmers look bad. I'm not looking of magic, just a little automation that might help a bit.
Unfortunately, the Halting Problem, Rice's Theorem, and all the other Undecidability and Uncomputability Results tell us that it is simply impossible in the general case to statically determine any "interesting" property about the runtime behavior of a program. We cannot even statically determine something as simple as "will it halt", so how are we going to determine "is bug-free"?
There are certain things that can be statically determined, and there are certain restricted programs for which some interesting properties can be statically determined, but largely, this is not possible. And even to the small extent that it is possible, it generally requires the language to be specifically designed to be easy to statically analyze (which Ruby isn't).
That being said, there are certain tools that contain certain heuristics to point out code that may have problems. There are certain coding standards that may help avoid bugs, and there are tools to enforce those coding standards. Keywords to search for are "code quality tools", "linter", "static analyzer", etc. You have already been given examples in the other answers and comments, and given those examples and these keywords, you'll likely find more.
However, I also wanted to discuss something you wrote:
we are forced to produce at a ridiculous pace we don't have 100% code coverage
That's a problem, which has to be approached from two sides:
Practice, practice, practice. You need to practice testing and writing high-quality code until it is so naturally to you that not doing it actually ends up being harder and slower. It should become second nature to you, such that under pressure when your mind goes blank, the only thing you know is to write tests and write well-designed, well-factored, high-quality code. Note: I'm talking about deliberate practice, which means setting time aside to really practice … and practice is practice, it's not work, it's not fun, it's not hobby, if you don't delete the code you wrote immediately after you have written it, you are not practicing, you are working.
Sustainable Pace. You should never develop faster than the pace you could sustain indefinitely while still producing well-tested, well-designed, well-factored, high-quality code, having a fulfilling social life, no stress, plenty of free time, etc. This is something that has to be backed and supported and understood by management.
I'm unaware of anything exactly like you want. However, there are a few gems that will analyze code and warn you about some errors and/or bad practices. Try these:
https://github.com/bbatsov/rubocop
https://github.com/railsbp/rails_best_practices
FLAY
https://rubygems.org/gems/flay
Via the repo https://github.com/seattlerb/flay:
DESCRIPTION:
Flay analyzes code for structural similarities. Differences in literal
values, variable, class, method names, whitespace, programming style,
braces vs do/end, etc are all ignored. Making this totally rad.
[FEATURES:]
Reports differences at any level of code.
Adds a score multiplier to identical nodes.
Differences in literal values, variable, class, and method names are ignored.
Differences in whitespace, programming style, braces vs do/end, etc are ignored.
Works across files.
Add the flay-persistent plugin to work across large/many projects.
Run --diff to see an N-way diff of the code.
Provides conservative (default) and --liberal pruning options.
Provides --fuzzy duplication detection.
Language independent: Plugin system allows other languages to be flayed.
Ships with .rb and .erb.
javascript and others will be
available separately.
Includes FlayTask for Rakefiles.
Uses path_expander, so you can use:
dir_arg -- expand a directory automatically
#file_of_args -- persist arguments in a file
-path_to_subtract -- ignore intersecting subsets of
files/directories
Skips files matched via patterns in .flayignore (subset format of .gitignore).
Totally rad.
FLOG
https://rubygems.org/gems/flog
Via the repo https://github.com/seattlerb/flog:
DESCRIPTION:
Flog reports the most tortured code in an easy to read pain report.
The higher the score, the more pain the code is in.
[FEATURES:]
Easy to read reporting of complexity/pain.
Uses path_expander, so you can use:
dir_arg – expand a directory automatically
#file_of_args – persist arguments in a file
-path_to_subtract – ignore intersecting subsets of files/directories
SYNOPSIS:
% ./bin/flog -g lib
Total Flog = 1097.2 (17.4 flog / method)
323.8: Flog total
85.3: Flog#output_details
61.9: Flog#process_iter
53.7: Flog#parse_options
...
There is a ruby gem called guard that does automated testing. You can set your own custom rules.
For example, you can make it where anytime you modify certain files, the test framework will automatically run.
Here is the link for guard
So, I've been working on a new project at work, and today had a coworker bring up the idea to me that my exceptions and even returned error messages should be completely localized. I thought maybe that was a good idea, but he said that I should only error return error codes. I personally don't like the error code idea a lot as it tends to make other programmers either
To reuse error codes where they don't fit because they don't want to add another one
They tend to use the wrong error codes as there can get to be so many defined.
So my question is what doe everyone else do to handle this situation? I'm open for all sorts of suggestions including those that think error codes are the way to go.
There may be cultural differences, according to your coding language ?
In Java for example, numerical errors codes are not used much ...
Concerning exceptions, I believe it is just a technical tool.
What is important is wether your message is targeted at a user, or a developper.
For a user, localizing messages is important, if several languages appears, or to be able to change the messages without recompiling (to customize between clients, to adapt to changing user needs ..).
In my projects, our culture is to use (java) enums to handle all collections of fixed values.
Errors are no different.
Enums for errors could provide :
strong typing (you can't pass something else to a method that expect an error code)
simple localisation (a simple utility method can find automatically the message corresponding to each one, using for example "SimpleClassName"."INSTANCE_NAME" pattern ; you could also expose the getMessage() method on each enum, that delegates the implementation to your utility method)
verification of your localized files (your unit tests could loop for each language on the code and the files, and find all unmatched values)
error level functionnality (we use the same levels as for logging : fatal, error, warn ; the logging decisions are very easily implemented then !).
to allow for easy finding of the appropriate error by other developpers, we use several enums (possibly in the same package), classifying the errors according to their technical or functionnal domain.
To adress your two concerns :
Adding one only requires adding an instance to an enum, and a message in the localisation file (but the tests can catch the later if forgotten).
With the classification in several enums, and possibly the javadoc, they are guided to use the correct error.
I wouldn't be using error codes for localization. There may be good reasons to use error codes (e.g. to be able to test which specific kind of error occurred), but localization is not one of those reasons. Instead, use the same framework that you use for the rest of the message localization also for exceptions. E.g. if you use gettext everywhere else, also use it in exceptions. That will make life easier for the translator.
You can include an error code in an exception, thereby getting the best of both.
One frequent cause of error with old-style function-return error codes was failure to check the error code before continuing with subsequent code. An exception cannot be implicitly ignored. Eliminating a source of error is a good thing.
An error code allows:
Calling code to distinguish between different kinds of errors.
Messages to be constructed by UI components when errors occur in non-UI, non-localized code.
A list of errors in the code that may be useful when writing user documentation or troubleshooting guides.
A few guidelines I have found useful:
Only UI architectural layers construct and localize messages to the user.
When non-UI layers communicate errors via exceptions, those may optionally carry error codes and additional fields useful to the UI when constructing a message.
In Java, when error codes are defined in a UI layer Enum, the error message format strings can be accessed through the enumerators themselves. They may contain symbols for additional data carried in the error.
In Java, include argument index in format specifiers in the original language's format strings, so that translators can move dynamic data around in localized messages.
I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.
I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:
Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?
I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.
Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...
Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)
Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.
As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.
By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)