lexers / parsers for (un) structured text documents [closed] - parsing

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.
It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).
I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?

Many lightweight markup languages like markdown (which incidentally SO uses), reStructured text and (arguably) POD are similar to what you're talking about. They have minimal syntax and break input down into parseable syntactic pieces. You might be able to get some information by reading about their implementations.

Define the annotation standard, which indicates how you would like to break things up.
Go on to Amazon Mechanical Turk and ask people to label 10K documents using your annotation standard.
Train a CRF (which is like an HMM, but better) on this training data.
If you actually want to go this route, I can elaborate on the details. But this will be a lot of work.

Most of the lex/yacc kind of programs work with a well defined grammar. if you can define your grammar in terms of a BNF like format (which most of the parsers accept similar syntax) then you can use any of them. That may be stating the obvious. However you can still be a little fuzzy around the 'blocks' (tokens) of text which would be part of your grammar. After all you define the rules for your tokens.
I have used Parse-RecDescent Perl module in the past with varying levels of success for similar projects.
Sorry, it may not be a good answer but more sharing my experiences on similar projects.

try: pygments, geshi, or prettify
They can handle just about anything you throw at them and are very forgiving of errors in your grammar as well as your documents.
References:
gitorius uses prettify,
github uses pygments,
rosettacode uses geshi,

Related

Keyword/keyphrase extraction from text [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is:
"ABC Inc has been working on a project related to machine learning which makes use of the existing libraries for finding information from big data."
The extracted keywords/keyphrase should be: {machine learning, big data}.
My text documents are stored as BSON documents in MongoDb.
What are the best nlp libraries(with sufficient documentation and examples) out there to perform this task and how?
Thanks!
It looks you need to narrow down more than just keywords/key phrases and find the subject and object per sentence.
For subject/object recognition, I recommend the Stanford Parser or the Google Language API, where you send a string and get a dependency tree response.
You can test the Google API first to see if it works well with your corpus: https://cloud.google.com/natural-language/
The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.
Other Packages:
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc. You may want to add your own custom dictionary of technology related keywords and keyphrases so that your parser can catch these if you decide to go with one of these packages.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Parsing math equations [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Just for kicks, I'm trying to create an application that can simplify, factor, and expand algebra equations. Programming the rules seems as if it will be straight forward if I can get the equations into a good workable format. Parsing the equations is proving to be a hassle, Currently working with Python but I'm not against having to learn something new.
Are there any libraries (for any language) that would make this project pretty simple, or is that a pipe dream?
[Tagging this with Haskell because I have a feeling that's where the 'simple' is]
Yes, Haskell has many many libraries that make writing parsers reasonably easy. Parsec is a good start, and it even has clones in other languages, including Python (that article also links to pyparsing which looks like it might also work).
This answer of mine is an example (note, it's probably not top-notch Parsec or Haskell): it's indicative of the power of Haskell's parsing libraries, precisely 4 lines of code implement the whole parser.
You could also browse old questions and answers to get a feel for the various libraries and techniques, e.g. parsec, parsing+haskell and parsing+python.
The best way to work out your line of attack for the larger project would be to start small and just try stuff until you're comfortable with your tools: choose a library and try to implement a relatively simple parser, like parsing expressions with just numbers, + and *, or even just parsing numbers and + with bracketing... something small (but not too small; those two examples each have non-trivialities, the first has operator precedence and the second has recursive nesting). If you don't like the library much, try a different library.
It's been done in just about every language.
Python has a library for parsing algebraic equations and symbolic mathematics all ready to go:
http://code.google.com/p/sympy/
I'd recommend reusing, unless your purpose is to learn how to write such a thing.
Python or matlab would be my suggestions. Are you planning on storing the whole equation in a string, and then split it up, to factor and simplify?
Give some more information, kindof a cool project.
This is an old question, but I'd like to suggest you MathParseKit.
This is a C++ library that given a string like "2*3/4" gives you a Tree of functions/variable/constants that defines the expression.
You can solve it, but you can even change it and put it again in string format.
You can find it at:
https://github.com/B3rn475/MathParseKit

Modeling software for network serialization protocol design [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am currently designing a low level network serialization protocol (in fact, a refinement of an existing protocol).
As the work progress, pen and paper documents start to show their limits: i have tons of papers, new and outdated merged together, etc... And i can't show anything to anyone since i describe the protocol using my own notation (a mix of flow chart & C structures).
I need a software that would help me to design a network protocol. I should be able to create structures, fields, their sizes, their layout, etc... and the software would generate some nice UMLish diagrams.
Sorry to say, everything I've seen so far (various serial protocols for embedded devices/networks) has used Word documents, with plain old tables showing allocations of fields to the bytes in the message. Alternatively, I've seen it done in Excel documents! It works, and people can read it.
Unfortunately, that's not helpful for automatic code generation, unless you have a very strict format in e.g. an Excel doc that you can then parse with a tool to generate some code. It would be good to have a notation that can be easily machine parsed, as well as human readable.
For showing message handshaking and sequences, a UML sequence diagram is good of course. There are lots of tools readily available to help you with that part of it.

A good F# codebase to learn from [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I've been teaching myself F# for a while now. I've read Programming F# by Chris Smith (great book) and I've written a few small scripts for getting the job done here and there.
But IMO the best way to learn a new programming language—and more importantly, the idioms that come with it—is to read a good open source codebase written in that language. Naturally, writing code in that language is crucial, but in the beginning, you're basically struggling with your own ignorance about how things should be done. You could perform certain tasks one way or the other, but it takes experience to realize the flaws and virtues of each. Even after you've gotten a firm grasp of how things work, reading the code of people who have an even firmer one helps a great deal.
Most would agree that the most insightful parts of any learn-a-programming-language book are the code examples, and reading a well-written open source codebase is the next level of that.
So are there any out there for F#?
Ref this question.
IMO, F# PowerPack is the best code base there.
Here are a few additional links that you may find interesting:
If you download F# for Visual Studio 2008, it also comes with sources of the entire F# library. This is sometimes a bit difficult code and it uses some internal tricks in a few places, but it is sometimes very good resource for learning (for example, Map type is a great example of a tree data structure).
There are some official F# Samples and some F# Community Samples (which includes my 3D fractal, example of working with quotations and a few shorter examples).
You can also look at the source code of samples from my Real-World Functional Programming book. Especially later chapters contain relatively complex sample applications (parallel simulations of animas, rectangle drawing application, etc.) The code has quite a lot of comments, so this may be useful for learning F#.
I would say that the WPF F# control codebase at http://wpffsharp.codeplex.com/ is a good place to start. One of the least trivial aspects of F# is UI and this is an excellent start to UI in F#. Also, since the code base is written by someone also learning F#, you have the benefit of seeing the iterations they go through.

What languages have strong string parsing capability like Perl's? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am familiar with Perl's strong parsing abilities using regular expressions.
Is it efficient?
What other languages have strong parsing ability and perform efficiently?
You can have a look at this benchmark which shows how different programming languages perform with regards to memory consumption and speed.
SNOBOL and Icon are two other languages devoted to manipulate strings. The first one is rather old while the second is not used much.
Anyway, I would start from your problem. Depending what are you trying to achieve (and you constraints) you might discover that even AWK, sed or gema would be a perfect match for your needs. Or not ...
I would dare to say that if parsing is so prominent in your task, you might benefit from using a parser generator (lex/yacc, ANTLR, lemon, ...).
Pretty much all modern languages have regular expressions that are relatively efficient: Java, C#, PHP, Python, even Javascript (amongst others).
I would say Python.
EDIT: I came across pystring, in case you're working in C++ but seek the flexibility of Python strings.
Powerbasic is well worth checking out. They have two versions. The Console Compiler would be ideal if you do not need GUI.
It is not on the Benchmark link above but it is extremely fast. I use it extensively for writing utilities to do specialized tasks.
Most languages these days have fast regexp libraries that you can use for your purposes. Perl's strength is that these are integrated into the language itself so you can do a lot of string processing with just the language core (as opposed to say, Python where it's a separate module).

Resources