Related
TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?
Context:
My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.
I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.
Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?
I can't answer your question with 100% certainty, but I can give you an opinion.
I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.
If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.
On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.
Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.
Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
My company's proprietary software generates a log file that is much easier to use if it is parsed. The log parser we all use was written by another employee as a side project, and it has horrible performance.
These log files can grow to 10s of megabytes very quickly, and the parser we currently use has issues if a log file is bigger than 1 megabyte.
So, I want to write a program that can parse this massive amount of text in the shortest amount of time possible. We use Windows exclusively, so running on Windows is a must. Our current implementation runs on a local web server, and I'm convinced that running it as an application would have to be faster.
All suggestions will be helpful. Thanks.
EDIT: My ultimate goal is to parse the text and display it in a much more user friendly manner with colors and such. Can you do this with Perl and Python? I know you can do this with Java and C++. So, it will function like Notepad where you open a log file, but on the screen you display the user-friendly format instead of the raw file.
EDIT: So, I cant choose the best answer, and that was to choose a language that can best display what I'm going for, and then write the parser in that. Also, using ANTLR will probably make this process much easier. I changed the original question, since I guess I didn't ask what I was really looking for. Thanks everyone!
Hmmm, "go with what you know" was a good answer. Perl was designed for this sort of thing (but imo is well suited for simple parsing, but I'd personally avoid it for complex projects).
If it gets even a little complex, why not use a proper syntax and grammar set-up?
Lex & Yacc (or Flex & Bison) spring to mind, but personally I would always reach for Antlr
Define various "words" in terms of patterns (syntax), and rules to combine those words (grammar) and Antlr will spit out a program to parse your input (you can have the program in Java, C, C++ and more (you are worried about parse time, so choose a compiled language, of course)).
I personally find it tedious to hand-craft parsers, and even more tedious to debug them, but AntlrWorks is a lovely IDE which really makes it a piece of cake ...
That bit at the bottom is defining a grammar rule.
If you mess up your grammar rules, you will be informed. This is not the case with hand-crafted parsers, where you just scratch your body part and wonder about the "strange results"...
Check it out. Even if you think your project is trivial now, it may well grow. And if you have any interest in parsing you do owe it to yourself to at least be familiar with lex/yacc, but especially Antlr(Works)
You should use the language that YOU know... Unless you have so much time available to complete the project that you can also spend the time learning a new language.
I would suggest using Python or Perl. Parsing large text files with regular expressions is really fast.
Whatever language your coworker used.
(I could tell you that any macro assembler will let you write code that would rip through your data, but seriously, are you going to spend months writing assembly just to save a few seconds of CPU time? Rewriting a program is fun but it's not practical.)
Whip out your profiler, point it at your horribly performing log parser, and fix the performance problems. If it's a common language, there will be people here who can help.
Parse this massive amount of text in the shortest time possible.
Consider the PADS Project from AT&T. It's a special-purpose language, compatible with C, that's designed exactly for high-speed parsing of log files and other ad hoc data formats. There's even a feature where it can try to learn your log format from examples, although I don't know if that has hit production yet. The people behind the project are really smart, and it's had a big impact within the phone company. PADS gives very high performance on data streams that produce gigabytes. Joe Bob says check it out.
If "massive text in the shortest time possible", Perl and Python are not the answer. But if you need to whip up something not too slow, and it's OK to take longer, Perl and Python could be OK. Tems of megabytes is not actually that big.
I've used both Python and Perl. Perl is a more natural fit for this but can be hard to maintain. Python will do it just as well and is easier to read. Go for Python.
I believe perl is considered a good choice to parse text.
Maybe a finished product such as the MS LogParser (usage podcast here) may do what you need and it's free.
Perl is good for text processing.
A number of very good text processing programs have been written in Perl. Ack (a grep replacement) is one.
Sounds like a job for Perl, much as I don't particularly care for it as a language myself. ActivePerl is a reasonable distribution of Perl for Windows.
I'd suggest Perl. It was practically built for parsing log files. As for output I agree with ghostdog74, HTML is the way to go. Perl has dozens of modules that allow you to build and/or template HTML.
I'd parse out the data using regular expressions, then use Template::Toolkit (on CPAN) to create nice pages using HTML and CSS templates.
c/c++ or java...
for c/c++ i have snippet that might help you:
FILE *f = fopen(file, "rb");
if(f == NULL) {
return DBDEMON_OPEN_ERROR; // open fail
}
for(int i = 0; feof(f) == 0; i++)
{
fscanf(f,"%d %s %s %c\n", &db[i].id, &db[i].name[0], &db[i].uid[0], &db[i].priviledge);
db_size++;
}
fclose(f);
this is reading a file with the following format:
int string string char
1 SOMETHING ANYTHING Z
to a struct define as follows:
typedef struct {
unsigned int id;
char name[DBDEMON_NAME_MAXSIZE];
char uid[DBDEMON_UID_MAXSIZE];
char priviledge;
} DATABASE;
Use fscanf with care, since no types are checked, etc, it can result in errors.
But I think this is pretty efficient.
Most of the posts that I read pertaining to these utilities usually suggest using some other method to obtain the same effect. For example, questions mentioning these tools usual have at least one answer containing some of the following:
Use the boost library (insert appropriate boost library here)
Don't create a DSL use (insert favorite scripting language here)
Antlr is better
Assuming the developer ...
... is comfortable with the C language
... does know at least one scripting
language (e.g., Python, Perl, etc.)
... must write some parsing code in almost
every project worked on
So my questions are:
What are appropriate situations which
are well suited for these utilities?
Are there any (reasonable) situations
where there is not a better
alternative to a problem than yacc
and lex (or derivatives)?
How often in actual parsing problems
can one expect to run into any short
comings in yacc and lex which are
better addressed by more recent
solutions?
For a developer which is not already
familiar with these tools is it worth
it for them to invest time in
learning their syntax/idioms? How do
these compare with other solutions?
The reasons why lex/yacc and derivatives seem so ubiquitous today are that they have been around for much longer than other tools, that they have far more coverage in the literature and that they traditionally came with Unix operating systems. It has very little to do with how they compare to other lexer and parser generator tools.
No matter which tool you pick, there is always going to be a significant learning curve. So once you have used a given tool a few times and become relatively comfortable in its use, you are unlikely to want to incur the extra effort of learning another tool. That's only natural.
Also, in the late 1960s and early 1970s when lex/yacc were created, hardware limitations posed a serious challenge to parsing. The table driven LR parsing method used by Yacc was the most suitable at the time because it could be implemented with a small memory footprint by using a relatively small general program logic and by keeping state in files on tape or disk. Code driven parsing methods such as LL had a larger minimum memory footprint because the parser program's code itself represents the grammar and therefore it needs to fit entirely into RAM to execute and it keeps state on the stack in RAM.
When memory became more plentiful a lot more research went into different parsing methods such as LL and PEG and how to build tools using those methods. This means that many of the alternative tools that have been created after the lex/yacc family use different types of grammars. However, switching grammar types also incurs a significant learning curve. Once you are familiar with one type of grammar, for example LR or LALR grammars, you are less likely to want to switch to a tool that uses a different type of grammar, for example LL grammars.
Overall, the lex/yacc family of tools is generally more rudimentary than more recent arrivals which often have sophisticated user interfaces to graphically visualise grammars and grammar conflicts or even resolve conflicts through automatic refactoring.
So, if you have no prior experience with any parser tools, if you have to learn a new tool anyway, then you should probably look at other factors such as graphical visualisation of grammars and conflicts, auto-refactoring, availability of good documentation, languages in which the generated lexers/parsers can be output etc etc. Don't pick any tool simply because "this is what everybody else seems to be using".
Here are some reasons I could think of for using lex/yacc or flex/bison :
the developer is already familiar with lex/yacc or flex/bison
the developer is most familiar and comfortable with LR/LALR grammars
the developer has plenty of books covering lex/yacc but no books covering others
the developer has a prospective job offer coming up and has been told that lex/yacc skills would increase his chances to get hired
the developer could not get buy-in from project members/stake holders for the use of other tools
the environment has lex/yacc installed and for some reason it is not feasible to install other tools
Whether it's worth learning these tools or not will depend heavily (almost entirely on how much parsing code you write, or how interested you are in writing more code on that general order. I've used them quite a bit, and find them extremely useful.
The tool you use doesn't really make as much difference as many would have you believe. For about 95% of the inputs I've had to deal with, there's little enough difference between one and another that the best choice is simply the one with which I'm most familiar and comfortable.
Of course, lex and yacc produce (and demand that you write your actions in) C (or C++). If you're not comfortable with them, a tool that uses and produces a language you prefer (e.g. Python or Java) will undoubtedly be a much better choice. I, for one, would not advise trying to use a tool like this with a language with which you're unfamiliar or uncomfortable. In particular, if you write code in an action that produces a compiler error, you'll probably get considerably less help from the compiler than usual in tracking down the problem, so you really need to be familiar enough with the language to recognize the problem with only a minimal hint about where compiler noticed something being wrong.
In a previous project, I needed a way to be able to generate queries on arbitrary data in a way that was easy for a relatively non-technical person to be able to use. The data was CRM-type stuff (so First Name, Last Name, Email Address, etc) but it was meant to work against a number of different databases, all with different schemas.
So I developed a little DSL for specifying the queries (e.g. [FirstName]='Joe' AND [LastName]='Bloggs' would select everybody called "Joe Bloggs"). It had some more complicated options, for example there was the "optedout(medium)" syntax which would select all people who had opted-out of receiving messages on a particular medium (email, sms, etc). There was "ingroup(xyz)" which would select everybody in a particular group, etc.
Basically, it allowed us to specify queries like "ingroup('GroupA') and not ingroup('GroupB')" which would be translated to an SQL query like this:
SELECT
*
FROM
Users
WHERE
Users.UserID IN (SELECT UserID FROM GroupMemberships WHERE GroupID=2) AND
Users.UserID NOT IN (SELECT UserID GroupMemberships WHERE GroupID=3)
(As you can see, the queries aren't as effecient as possible, but that's what you get with machine generation, I guess).
I didn't use flex/bison for it, but I did use a parser generator (the name of which has escaped me at the moment...)
I think it's pretty good advice to eschew the creation of new languages just to support a Domain specific language. It's going to be a better use of your time to take an existing language and extend it with domain functionality.
If you are trying to create a new language for some other reason, perhaps for research into language design, then these tools are a bit outdated. Newer generators such as antlr, or even newer implementation languages like ML, make language design a much easier affair.
If there's a good reason to use these tools, it's probably because of their legacy. You might already have a skeleton of a language you need to enhance, which is already implemented in one of these tools. You might also benefit from the huge volumes of tutorial information written about these old tools, for which there is not so great a corpus written for newer and slicker ways of implementing languages.
We have a whole programming language implemented in my office. We use it for that. I think it's meant to be a quick and easy way to write interpreters for things. You could conceivably write almost any sort of text parser using them, but a lot of times it's either A) easier to write it yourself quick or B) you need more flexibility than they provide.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have only little knowledge about LaTeX, basic formatting, basic math fomulae etc.. I found that LaTeX is hard to configure to my own flavor. Recently, I've heard about Docbook, which is also a typesetting mechanism, but much easier since it uses XML. So, if my main job using LaTeX/Docbook is writing a simple document (not a class book) with some mathematics, and I want easy configuration, and a highly constomizable application, which one is better, and is there any good book on Docbook?
DocBook isn't "a typesetting mechanism". DocBook is all about separating presentation from content. DocBook only deals with content; it's used to create an abstract representation of a book, article, etc. There are numerous tools out there which layout DocBook according to predefined templates. Some of these tools use LaTeX. AFAIK, O'Reilly uses a slightly modified version of the DocBook language to author their content, then they feed this XML into custom scripts that integrate with Adobe FrameMaker to layout their books.
LaTeX is essentially an attempt to separate presentation from content within TeX, but it doesn't quite achieve that goal IMO. Presentation is still mixed with the content in most cases. I think LaTeX is currently the best open source tool for laying-out paginated documents. However, proprietary tools like InDesign have many features (like good OpenType support) that TeX doesn't have (XeTeX kind of adds OpenType support). Either way, if you're writing a book, I highly recommend using DocBook to author your content rather than LaTeX.
That said, it sounds like you're writing short, one-off documents with a bit of math. I think LaTeX is probably your best choice. If you need lots of customizability, you might need to use Plain TeX as opposed to LaTeX, but it's going to require quite a bit of work on your part.
Well, I haven't used DocBook, but from a quick look on wikipedia and google:
DocBook does not have elements to describe mathematics.
DocBook is XML, as you say. To me, that makes it a horrible thing to write by-hand (or, rather, with a basic text editor). Maybe you enjoy writing XML, or have a good IDE. I guess you could look at this question.
DocBook's Wikipedia page lists a couple of books on it which you may want to look at, though I obviously can't say whether they are "good" books.
I would suggest going with LaTeX. Get someone to give you a basic template, then writing LaTeX is as simple as:
\section{Introduction}
This is my introduction.
\section{Stuff}
Here is some stuff.
\subsection{Particular stuff}
A particular type of stuff. With maths:
$\int_{x=1}^n 3x^2$
% etc.
Google is your friend for finding basic templates that you can start from:
One
Two
Three
To go from source code to a document, you'll need a working install of LaTeX (which is beyond the scope of this answer, but is pretty easy if you're on linux). Ideally your LaTeX install will include pdflatex. Then you just run:
pdflatex source.tex
(there's a bit more work if you have a bibliography – but that's a topic for a different question)
The great thing about DocBook is that it is XML based - so a chapter is a full subtree, a section is a full subtree, etc. In LaTeX, separation is only determined by the structure of the document during a linear scan.
The worst thing about Docbook is that it is XML based - lower-level stuff is extremely dirty and annoying to code manually.
I'm not really familiar with DocBook, though I have used LaTeX fairly extensively. The idea of LaTeX is not to produce a customized document, it's to produce a readable, attractive document. It's a set of libraries, templates, macros, and so forth around TeX, set up by people who know what they are doing when it comes to document design. Of course, you have special needs that they can't anticipate, so you're going to have to do some tweaking, too. It is a very high-level, declarative language that is meant to reflect the content and structure of a document, rather than what it should look like, the idea being that your ideas and how they are organized is what you should concern yourself with, not the layout of your text on the page. If you need more control, there exists a HUGE library of additional styles and macros and so forth (CTAN), and some of them (memoir comes to mind) give you back a lot of that control.
If you are shoving a lot of complicated formatting stuff into the body of your LaTeX document, you're doing it wrong. What you need to do is get your content in there, and your document structured into chapters and sections and subsections semantically, then go back in and worry about formatting. You shouldn't have to go into the body of your document much at this point; it should all be general stuff that applies to the whole document, preferably in a reusable way. This ensures consistency.
Yes, LaTeX is kind of difficult to configure to produce exactly the kind of layout you want. I suggest you take a look at the manual of the LaTeX class memoir to see what kinds of layouts it enables you to produce.
There is a book on DocBook available online. Take a look at that too, to see what kind of layouts you can produce and if you can easily format the math content you want with DocBook.
My suggestion is to go with LaTeX if you have to write any nontrivial math, but of course it depends on which format you find it easier to work with.
About two years ago, I tried to like and use DocBook; however, I returned to LaTeX because, at least at the time, LaTeX produced better quality output (PDFs). I never managed to get the DocBook to LaTeX to PDF translation working. My problems were likely "operator error", but I suggest trying DocBook (and LaTeX) for a few simple documents before choosing one.
Here are a few points that led me to choose LaTeX:
BibTeX for bibliographies with JabREF as a GUI
Excellent quality PDF output
Lots of examples on the Internet, including several similar to my preferred format
Good books, like "A Guide to LaTeX"
If you like GUIs, take a look at LyX.
The real reasons to use DocBook center on having your document marked up meaningfully, being able to validate it, and transform it for many purposes, not only publishing. LaTeX and other macro sets add a layer of semantic markup, but you're always free to introduce TeX code, and add macros from other sources. Fundamentally, a TeX document is a computer program that can only be parsed by a TeX processor.
For maths and DocBook: DocBook being XML it allows you to use other XML technologies as appropriate; in this case MathML. The XMLmind XMLEditor already mentioned provides a GUI maths editor, and includes stylesheets to format them for web and print along with the DocBook contents.
There are also tools available that enable translation of XML documents into other languages (xml2po is a simple one, http://heartsome.net/EN/home.html is a whole suite).
I don't want to go down the "easier" or better route as I regard this as a matter of taste and getting used to. I see docbook being XML as an advantage as therefore it can be morphed into almost anything you like by using XSLT. Combined with its self-containedness it feels more like structuring content that Latex does. Especially documenting open source software Docbook is really widely used. You can easily grab the templates and stylesheets of e.g. Hibernate and/or Spring and tweak them to your needs.
Another aspect I'd like to spot on is integration in build systems. For Maven there is a plugin called docbkx available, that just spits out PDF, HTML and whatever you like based on the contents and an appropriate XSLT. No further installations needed. The only ways I have seen to get this done with Latex is installing a few packages to the build OS and building your own script around em. IMHO that's not a feasible way to go, especially if you build cross platform.
Regarding the editor I can advise XMLmind XMLEditor that takes a lot of the pain and provides quite a nice WYSIWYG approach to docbook.
If you rely on mathematical expressions I also would rather choose Latex as there is nothing with the same power available in docbook.
FWIW I use docbook via xmlmind (http://xmlmind.com/) to produce html and .chm files. I've also set fop up to produce pdfs, but they aren't pretty.
Having got the docbook source done, I cook it with xsltproc and the docbook.xsl files. This is protracted and painful to set up, but once it's working it's sweet.
Another approach would be to use pandoc (an extended markdown type tool) to get from markdown to DocBook. This would cut the xml editor out, but you still have to do the transformation(s) to your output format.
Whoever had to create a professional, scientific document (research paper, book, technical guide etc.) will know why TeX is a better choice.
For those who are not aware of some facts here is a perfect example: at good colleges student's work may be completely refused if (s)he did not properly reference other people's works. There are, I believe, hundreds of "official" ways for citing and referencing, Harvard school has its own, ACM their own, among computer scientists numeric (Vancouver) notation is the most common. Many professional organisations have their own styles, and they stick to it. As far as I know, TeX is the only typesetting system that is aware of that, and with the help of BiBTeX it becomes extremely powerful tool for authors. It can save hours, if not days, of work.
If I was a novel writer, or author of some non-technical document, I might chose DocBook.
Have you looked at ConTeXt. It is more flexible and much easier to configure compared to LaTeX.
Arbortext supports native LaTeX. You can send the publishing engine or print composer LaTeX and it'll pass it through. It also supports a lot of other composition languages as well.
I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.
Is there a standard format for describing either concrete or abstract syntax trees?
Do any widely-used tool chains support outputting to these formats?
I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.
I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.
Thanks!
The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another.
See http://www.gupro.de/GXL
More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees.
See http://www.omg.org/spec/ASTM/1.0/Beta1/
This problem seems to get solved over and over again.
There's half a dozen "tool bus" proposals made over the years
that all solved it, with no one ever overtaking the industry.
The problem is that a) it is easy to represent ASTs using
any kind of nestable notation [parentheses like LISP,
like XML, ...] so people roll their own solution easily,
and b) for one tool to exchange an AST with another, they
both have to agree essentially on what the AST nodes mean;
but most ASTs are rather accidentally derived from the particular
grammar/parsing technology used by each tool, and there's
almost always disagreement about that between tools.
So, I've seen very few tools that exchange ASTs meaningfully.
If you're doing a hobby thing, I'd stick with a lisp-like
encoding of trees, where each node has the following format:
( ... )
Its easy to generate, and easy to read.
I work on a professional tool to manipulate programs. If we
have print out the AST, we do the above. Mostly individual
ASTs are far too complicated to look at in practice,
so we hardly ever print out the entire AST, at best only
a node and a few children deep. Our tool doesn't exchange
ASTs with anybody (see above reasons :) but does just
fine building it in memory, doing whizzy things with it
for analysis reasons or transformation reasons, and then
either just deleteing it (no need to send it anywhere)
or regenerating the original language text from the tree.
[The latter means you need anti-parsing or "prettyprinting"
technology]
In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.
This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.
I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.
You might also want to look at ANTLR's tree grammars.
Specific standards are an expectation, while more general purpose standards may also be appropriate. Ira Baxter already mentioned GXL, and RDF may be added too, just that it would require an appropriate ontology and is more oriented toward semantic than syntax. Still may be an option to investigate.
For specific standards, Ira Baxter already mentioned ASTM, another one, although it rather targets a specific kind of programming language (logic languages), is a standard for semantic/conceptual graph, known as ISO‑IEC 24707 2007.
Not a standard on its own, but a paper about that matter: Towards Portable Source Code Representations Using XML
.
I don't know any effectively used standard (in this area, that's always house‑made cooking everywhere), I'm just interested too in this topic.