Resources for character and text processing (encoding, regular expressions, NLP) - parsing

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need to learn about advanced topics right away. But I need to know:
Bit and bytes level knowledge of encodings.
Characters and alphabets not used in English.
Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
Regular expressions.
Algorithm for text processing.
Parsing natural languages.
I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.
I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)

In addition to wikipedia, Joel Spolskys article on encoding is really good too.
This free character map is a nice resource for all unicode characters.
This regular expression tutorial can be helpful.
Specifically on NLP and Japanese, you could
take a look at this Japanese NLP
project.
On text processing, this Open
Source project can be useful.

As is usual for most general "I want to learn about X topic" questions, Wikipedia is a good place to start:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Natural_language_processing

Related

Software to identify patterns in text files

I work on some software that parses large text files and inserts data into a database. Every time we get a new client, we have to write new parsing code for their text files.
I'm looking for some software to help simplify analyzing the text files. It would be nice to have some software that could identify patterns in the file.
I'm also open to any general purpose parsing libraries (.NET) that may simplify the job. Or any other relevant software.
Thanks.
More Specific
I open a text file with some magic software that shows me repeating patterns that it has identified. Really I'm just looking for any tools that developers have used to help them parse files. If something has helped you do this, please tell me about it.
Well, likely not exactly what you are looking for, but clone detection might be the right kind of idea.
There are a variety of such detectors. Some work only one raw lines of text, and that might apply directly to you.
Some work only on the works ("tokens") that make up the text, for some definition of "token".
You'd have to define what you mean by tokens to such tools.
But you seem to want something that discovers the structure of the text and then looks for repeating blocks with some parametric variation. I think this is really hard to do, unless you know sort of what that structure is in advance.
Our CloneDR does this for programming language source code, where the "known structure" is that of the programming language itself, as described specifically by the BNF grammar rules.
You probably don't want to Java-biased duplicate detection on semi-structured text. But if you do know something about the structure of the documents, you could write that down as a grammar, and our CloneDR tool would then pick it up.

LaTeX vs DocBook [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have only little knowledge about LaTeX, basic formatting, basic math fomulae etc.. I found that LaTeX is hard to configure to my own flavor. Recently, I've heard about Docbook, which is also a typesetting mechanism, but much easier since it uses XML. So, if my main job using LaTeX/Docbook is writing a simple document (not a class book) with some mathematics, and I want easy configuration, and a highly constomizable application, which one is better, and is there any good book on Docbook?
DocBook isn't "a typesetting mechanism". DocBook is all about separating presentation from content. DocBook only deals with content; it's used to create an abstract representation of a book, article, etc. There are numerous tools out there which layout DocBook according to predefined templates. Some of these tools use LaTeX. AFAIK, O'Reilly uses a slightly modified version of the DocBook language to author their content, then they feed this XML into custom scripts that integrate with Adobe FrameMaker to layout their books.
LaTeX is essentially an attempt to separate presentation from content within TeX, but it doesn't quite achieve that goal IMO. Presentation is still mixed with the content in most cases. I think LaTeX is currently the best open source tool for laying-out paginated documents. However, proprietary tools like InDesign have many features (like good OpenType support) that TeX doesn't have (XeTeX kind of adds OpenType support). Either way, if you're writing a book, I highly recommend using DocBook to author your content rather than LaTeX.
That said, it sounds like you're writing short, one-off documents with a bit of math. I think LaTeX is probably your best choice. If you need lots of customizability, you might need to use Plain TeX as opposed to LaTeX, but it's going to require quite a bit of work on your part.
Well, I haven't used DocBook, but from a quick look on wikipedia and google:
DocBook does not have elements to describe mathematics.
DocBook is XML, as you say. To me, that makes it a horrible thing to write by-hand (or, rather, with a basic text editor). Maybe you enjoy writing XML, or have a good IDE. I guess you could look at this question.
DocBook's Wikipedia page lists a couple of books on it which you may want to look at, though I obviously can't say whether they are "good" books.
I would suggest going with LaTeX. Get someone to give you a basic template, then writing LaTeX is as simple as:
\section{Introduction}
This is my introduction.
\section{Stuff}
Here is some stuff.
\subsection{Particular stuff}
A particular type of stuff. With maths:
$\int_{x=1}^n 3x^2$
% etc.
Google is your friend for finding basic templates that you can start from:
One
Two
Three
To go from source code to a document, you'll need a working install of LaTeX (which is beyond the scope of this answer, but is pretty easy if you're on linux). Ideally your LaTeX install will include pdflatex. Then you just run:
pdflatex source.tex
(there's a bit more work if you have a bibliography – but that's a topic for a different question)
The great thing about DocBook is that it is XML based - so a chapter is a full subtree, a section is a full subtree, etc. In LaTeX, separation is only determined by the structure of the document during a linear scan.
The worst thing about Docbook is that it is XML based - lower-level stuff is extremely dirty and annoying to code manually.
I'm not really familiar with DocBook, though I have used LaTeX fairly extensively. The idea of LaTeX is not to produce a customized document, it's to produce a readable, attractive document. It's a set of libraries, templates, macros, and so forth around TeX, set up by people who know what they are doing when it comes to document design. Of course, you have special needs that they can't anticipate, so you're going to have to do some tweaking, too. It is a very high-level, declarative language that is meant to reflect the content and structure of a document, rather than what it should look like, the idea being that your ideas and how they are organized is what you should concern yourself with, not the layout of your text on the page. If you need more control, there exists a HUGE library of additional styles and macros and so forth (CTAN), and some of them (memoir comes to mind) give you back a lot of that control.
If you are shoving a lot of complicated formatting stuff into the body of your LaTeX document, you're doing it wrong. What you need to do is get your content in there, and your document structured into chapters and sections and subsections semantically, then go back in and worry about formatting. You shouldn't have to go into the body of your document much at this point; it should all be general stuff that applies to the whole document, preferably in a reusable way. This ensures consistency.
Yes, LaTeX is kind of difficult to configure to produce exactly the kind of layout you want. I suggest you take a look at the manual of the LaTeX class memoir to see what kinds of layouts it enables you to produce.
There is a book on DocBook available online. Take a look at that too, to see what kind of layouts you can produce and if you can easily format the math content you want with DocBook.
My suggestion is to go with LaTeX if you have to write any nontrivial math, but of course it depends on which format you find it easier to work with.
About two years ago, I tried to like and use DocBook; however, I returned to LaTeX because, at least at the time, LaTeX produced better quality output (PDFs). I never managed to get the DocBook to LaTeX to PDF translation working. My problems were likely "operator error", but I suggest trying DocBook (and LaTeX) for a few simple documents before choosing one.
Here are a few points that led me to choose LaTeX:
BibTeX for bibliographies with JabREF as a GUI
Excellent quality PDF output
Lots of examples on the Internet, including several similar to my preferred format
Good books, like "A Guide to LaTeX"
If you like GUIs, take a look at LyX.
The real reasons to use DocBook center on having your document marked up meaningfully, being able to validate it, and transform it for many purposes, not only publishing. LaTeX and other macro sets add a layer of semantic markup, but you're always free to introduce TeX code, and add macros from other sources. Fundamentally, a TeX document is a computer program that can only be parsed by a TeX processor.
For maths and DocBook: DocBook being XML it allows you to use other XML technologies as appropriate; in this case MathML. The XMLmind XMLEditor already mentioned provides a GUI maths editor, and includes stylesheets to format them for web and print along with the DocBook contents.
There are also tools available that enable translation of XML documents into other languages (xml2po is a simple one, http://heartsome.net/EN/home.html is a whole suite).
I don't want to go down the "easier" or better route as I regard this as a matter of taste and getting used to. I see docbook being XML as an advantage as therefore it can be morphed into almost anything you like by using XSLT. Combined with its self-containedness it feels more like structuring content that Latex does. Especially documenting open source software Docbook is really widely used. You can easily grab the templates and stylesheets of e.g. Hibernate and/or Spring and tweak them to your needs.
Another aspect I'd like to spot on is integration in build systems. For Maven there is a plugin called docbkx available, that just spits out PDF, HTML and whatever you like based on the contents and an appropriate XSLT. No further installations needed. The only ways I have seen to get this done with Latex is installing a few packages to the build OS and building your own script around em. IMHO that's not a feasible way to go, especially if you build cross platform.
Regarding the editor I can advise XMLmind XMLEditor that takes a lot of the pain and provides quite a nice WYSIWYG approach to docbook.
If you rely on mathematical expressions I also would rather choose Latex as there is nothing with the same power available in docbook.
FWIW I use docbook via xmlmind (http://xmlmind.com/) to produce html and .chm files. I've also set fop up to produce pdfs, but they aren't pretty.
Having got the docbook source done, I cook it with xsltproc and the docbook.xsl files. This is protracted and painful to set up, but once it's working it's sweet.
Another approach would be to use pandoc (an extended markdown type tool) to get from markdown to DocBook. This would cut the xml editor out, but you still have to do the transformation(s) to your output format.
Whoever had to create a professional, scientific document (research paper, book, technical guide etc.) will know why TeX is a better choice.
For those who are not aware of some facts here is a perfect example: at good colleges student's work may be completely refused if (s)he did not properly reference other people's works. There are, I believe, hundreds of "official" ways for citing and referencing, Harvard school has its own, ACM their own, among computer scientists numeric (Vancouver) notation is the most common. Many professional organisations have their own styles, and they stick to it. As far as I know, TeX is the only typesetting system that is aware of that, and with the help of BiBTeX it becomes extremely powerful tool for authors. It can save hours, if not days, of work.
If I was a novel writer, or author of some non-technical document, I might chose DocBook.
Have you looked at ConTeXt. It is more flexible and much easier to configure compared to LaTeX.
Arbortext supports native LaTeX. You can send the publishing engine or print composer LaTeX and it'll pass it through. It also supports a lot of other composition languages as well.

Is ANTLR an appropriate tool to serialize/deserialize a binary data format?

I need to read and write octet streams to send over various networks to communicate with smart electric meters. There is an ANSI standard, ANSI C12.19, that describes the binary data format. While the data format is not overly complex the standard is very large (500+ pages) in that it describes many distinct types. The standard is fully described by an EBNF grammar. I am considering utilizing ANTLR to read the EBNF grammar or a modified version of it and create C# classes that can read and write the octet stream.
Is this a good use of ANTLR?
If so, what do I need to do to be able to utilize ANTLR 3.1? From searching the newsgroup archives it seems like I need to implement a new stream that can read bytes instead of characters. Is that all or would I have to implement a Lexer derivative as well?
If ANTLR can help me read/parse the stream can it also help me write the stream?
Thanks.
dan finucane
You might take a look at Ragel. It is a state machine compiler/lexer that is useful for implementing on-the-wire protocols. I have read reports that it generates very fast code. If you don't need a parser and template engine, ragel has less overhead than ANTLR. If you need a full-blown parser, AST, and nice template engine support, ANTLR might be a better choice.
This subject comes up from time to time on the ANTLR mailing list. The answer is usually no, because binary file formats are very regular and it's just not worth the overhead.
It seems to me that having a grammar gives you a tremendous leg up.
ANTLR 3.1 has StringTemplate and code generation features that are separate from the parsing/lexing, so you can decompose the problem that way.
Seems like a winner to me, worth trying.

Writing a parser - In the need of guides and research papers

My knowledge about implementing a parser is a bit rusty.
I have no idea about the current state of research in the area, and could need some links regarding recent advances and their impact on performance.
General resources about writing a parser are also welcome, (tutorials, guides etc.) since much of what I had learned at college I have already forgotten :)
I have the Dragon book, but that's about it.
And does anyone have input on parser generators like ANTLR and their performance? (ie. comparison with other generators)
edit My main target is RDF/OWL/SKOS in N3 notation.
Mentioning the dragon book and antlr means you've answered your own question.
If you're looking for other parser generators you could also check out boost::spirit (http://spirit.sourceforge.net/).
Depending on what you're trying to achieve you might also want to consider a DSL, which you can either parse yourself or write in a scripting language like boo, ruby, python etc...
Hmm … your request is a bit unspecific. While there are many recent developments in this general area, they're all quite specialized (naturally, since the field has matured). The original parsing approaches haven't really changed, though. You might want to read up on changes in parser creation tools (Antlr, Gold Parser, to name but a few).
You might also want to take a look at SableCC, another parser generator "which generates fully featured object-oriented frameworks for building compilers".
Their is some documentation about basic uses here and here. Since you asked about research papers, SableCC's main developper's master thesis (1998) is available and explains a little more about SableCC advantages.
Although the current stable version is 3.2, the development branch v4 is a complete rewrite and should implement features new to parser generators.
If you want to build custom analyzers for complex languages,
consider our DMS Software Reengineering Toolkit.
See http://www.semanticdesigns.com/Products/DMS/DMSToolkit.html
This provides very strong parsing technology, making it "easy" to define your language
(especially in comparison with most parser generators).
Conventional parser generators may help
with parsing, but they provide zero help in the hard part of the
process, which happens after you can parse the code.
DMS provides a vast amount of machinery to support analyzing and transforming
the code once your have parsed it.

Parsing, where can I learn about it

I've been given a job of 'translating' one language into another. The source is too flexible (complex) for a simple line by line approach with regex. Where can I go to learn more about lexical analysis and parsers?
If you want to get "emotional" about the subject, pick up a copy of "The Dragon Book." It is usually the text in a compiler design course. It will definitely meet your need "learn more about lexical analysis and parsers" as well as a bunch of other fun stuff!
IMH(umble)O, save yourself an arm and/or leg and buy an older edition - it will fill your information desires.
Try ANLTR:
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There's a book for it also.
Niklaus Wirth's book "Compiler Construction" (available as a free PDF)
http://www.google.com/search?q=wirth+compiler+construction
I've recently been working with PLY which is an implementation of lex and yacc in Python. It's quite easy to get started with it and there are some simple examples in the documentation.
Parsing can quickly become a very technical topic and you'll find that you probably won't need to know all the details of the parsing algorithm if you're using a parser builder like PLY.
Lots of people have recommended books. For many these are much more useful in a structured environment with assignments and due dates and so forth. Even if not, having the material presented in a different way can help greatly.
(a) Have you considered going to a school with a decent CS curriculum?
(b) There are lots of online lectures, such as MIT's Open Courseware. Their EE/CS section has many courses that touch on parsing, though I can't see any on parsing per se. It's typically introduced as one of the first theory courses as language classification and automata is at the heart of much of CS theory.
If you prefer Java based tools, the Java Compiler Compiler, JavaCC, is a nice parser/scanner. It's config file driven, and will generate java code that you can include in your program. I haven't used it a couple years though, so I'm not sure how the current version is. You can find out more here: https://javacc.dev.java.net/
Lexing/Parsing + typecheck + code generation is a great CS exercise I would recommend it to anyone wanting a solid basis, so I'm all for the Dragon Book
I found this site helpful:
Lex and YACC primer/HOWTO
The first time I used lex/yacc was for a relatively simple project. This tutorial was all I really needed. When I approached more complex projects later, the familiarity I had from this tutorial and a simple project allowed me to build something fancier.
After taking (quite) a few compilers classes, I've used both The Dragon Book and C&T. I think C&T does a far better job of making compiler construction digestible. Not to take anything away from The Dragon Book, but I think C&T is a far more practical book.
Also, if you like writing in Java, I recommend using JFlex and BYACC/J for your lexing and parsing needs.
Yet another textbook to consider is Programming Language Pragmatics. I prefer it over the Dragon book, but YMMV.
If you're using Perl, yet another tool to consider is Parse::RecDescent.
If you just need to do this translation once and don't know anything about compiler technology, I would suggest that you get as far as you can with some fairly simplistic translations and then fix it up by hand. Yes, it is a lot of work. But it is less work than learning a complex subject and coding up the right solution for one job. That said, you should still learn the subject, but don't let not knowing it be a roadblock to finishing your current project.
Parsing Techniques - A Practical Guide
By Dick Grune and Ceriel J.H. Jacobs
This book (freely available as PDF) gives an extensive overview of different parsing techniques/algorithms. If you really want to understand the different parsing algorithms, this IMO is a better reference than the Dragon Book (as Parsing Techniques focuses entirely on parsing, while the Dragon Book covers parsing only as one - although important - part of the compiler construction process).
flex and bison are the new lex and yacc though. The syntax for BNF is often derided for being a bit obtuse. Some have moved to ANTLR and Ragel for this reason.
If you're not doing much translation, you may one to pull a one-off using multiline regexes with Perl or Ruby. Writing a compatible BNF grammar for an existing language is not a task to be taken lightly.
On the other hand, it is entirely possible to leverage any given language's .l and .y files if they are available as open source. Then, you could construct new code from an existing parse tree.

Resources