{"extractorData":{"url":"http://mobcrush.com","resourceId":"VALUE","data":[{"group":[{"Userpart value":[{"text":"Galadon"}]},{"Userpart value":[{"text":"ShinKaigan"}]},{"Userpart value":[{"text":"Minecon2016"}]},{"Userpart value":[{"text":"Asater"}]},{"Userpart value":[{"text":"PixieMethod"}]},{"Userpart value":[{"text":"MrSilent"}]},{"Userpart value":[{"text":"MadeMoiselle"}]},{"Userpart value":[{"text":"RohanLive"}]},{"Userpart value":[{"text":"TheRealMcSlushie"}]},{"Userpart value":[{"text":"gibbs"}]},{"Userpart value":[{"text":"karlminer"}]},{"Userpart value":[{"text":"etowah5"}]},{"Userpart value":[{"text":"Suha"}]},{"Userpart value":[{"text":"esl_hearthstone"}]},{"Userpart value":[{"text":"Feller_Rus"}]},{"Userpart value":[{"text":"『Bel』"}]},{"Userpart value":[{"text":"Tenebray"}]},{"Userpart value":[{"text":"T3x05"}]},{"Userpart value":[{"text":"rikkrollins"}]},{"Userpart value":[{"text":"xwarpewpew"}]}]}]},"pageData":{"resourceId":"VALUE","statusCode":200,"timestamp":1474736137294},"url":"http://mobcrush.com","runtimeConfigId":"VALUE","timestamp":1474736451447,"sequenceNumber":-1}
1) Identify the type of data this is [showing us an example only helps us eliminate what it is not]. Is it JSON?
2) Get a parser for that kind of data, or build such a parser. For standard types of data exchange formats like JSON, there are typically parser libraries for major languages already available. If not, how to build parsers is well understood and you can build such a parser.
[See my SO article on how to build recursive descent parsers by hand.]
Related
I've seen two approaches to parsing:
Use a parser generator like happy. This allows you to specify your language in BNF, and not worry about the intricacies of parsing. However, since it's a preprocessor you have to write your whole parse tree textually.
Use a parser directly like megaparsec. With this approach you have direct access to your code so you can generate your parser programatically, but you haven't got the convenience of happy's simple BNF specification with precedence annotations etc. Also it seems non trivial to print out a BNF tree for documentation from your parsing code unless this is considered during it's construction.
What I'd like to do is something like this:
Generate a data structure programatically that represents BNF.
Feed this through to a "happy like" parser generator to generate a parser.
Feed this through a pretty printer to generate actual BNF documentation.
The reason I want to do this is that the grammar I'm working on has grown quite large and has a lot of repetition, as a lot of it's constructs are similar to others but slightly different. It would improve maintenence effort if it could be generated programmatically instead of modifying happy BNF spec directly, but I'd rather not have to develop my own parser from scratch.
Any ideas about a good approach here. It would be great if I could just generate a data structure and force it into happy (as it presumably generates it's own internal structure after parsing the BNF feed to it) but happy doesn't seem to have a library interface.
I guess I could generate attonated BNF, and feed that through to happy, but it seems like a messy process of converting back and forth. A cleaner approach would be better. Perhaps even a BNF style extension to parsec or megaparsec?
The simplest thing to do would to make some data type representing the relevant grammar, and then convert it to a parser using some parser combinators as a (run-time) "compile" step. Unfortunately, most parser combinators are less efficient and/or less flexible (in some ways) than the parser generators, so this would be a bit of a lowest common denominator approach. That said, the grammar-combinators library may be useful, though it doesn't appear to be maintained.
There are libraries that can generate parsers at run-time. One I found just now is Grempa, which doesn't appear to be maintained but that may not be a problem. Another option (by the same person who made Grempa but maintained) is Earley which, due to the way Earley parsers are made, it makes sense to have an explicit grammar that gets processed into a parser. Earley parsing is certainly flexible, but may be overpowered for you (or maybe not).
I was entirely amazed by how Coq's parser is implemented. e.g.
https://softwarefoundations.cis.upenn.edu/lf-current/Imp.html#lab347
It's so crazy that the parser seems ok to take any lexeme by giving notation command and subsequent parser is able to parse any expression as it is. So what it means is the grammar must be context sensitive. But this is so flexible that it absolutely goes beyond my comprehension.
Any pointers on how this kind of parser is theoretically feasible? How should it work? Any materials or knowledge would work. I just try to learn about this type of parser in general. Thanks.
Please do not ask me to read Coq's source myself. I want to check the idea in general but not a specific implementation.
Indeed, this notation system is very powerful and it was probably one of the reasons of Coq's success. In practice, this is a source of much complication in the source code. I think that #ejgallego should be able to tell you more about it but here is a quick explanation:
At the beginning, Coq's documents were evaluated sentence by sentence (sentences are separated by dots) by coqtop. Some commands can define notations and these modify the parsing rules when they are evaluated. Thus, later sentences are evaluated with a slightly different parser.
Since version 8.5, there is also a mechanism (the STM) to evaluate a document fully (many sentences in parallel) but there is some special mechanism for handling these notation commands (basically you have to wait for these to be evaluated before you can continue parsing and evaluating the rest of the document).
Thus, contrary to a normal programming language, where the compiler will take a document, pass it through the lexer, then the parser (parse the full document in one go), and then have an AST to give to the typer or other later stages, in Coq each command is parsed and evaluated separately. Thus, there is no need to resort to complex contextual grammars...
I'll drop my two cents to complement #Zimmi48's excellent answer.
Coq indeed features an extensible parser, which TTBOMK is mainly the work of Hugo Herbelin, built on the CAMLP4/CAMLP5 extensible parsing system by Daniel de Rauglaudre. Both are the canonical sources for information about the parser, I'll try to summarize what I know but note indeed that my experience with the system is short.
The CAMLPX system basically supports any LL1 grammar. Coq exposes to the user the whole set of grammar rules, allowing the user to redefine them. This is the base mechanism on which extensible grammars are built. Notations are compiled into parsing rules in the Metasyntax module, and unfolded in a latter post-processing phase. And that really is AFAICT.
The system itself hasn't changed much in the whole 8.x series, #Zimmi48's comments are more related to the internal processing of commands after parsing. I recently learned that Coq v7 had an even more powerful system for modifying the parser.
In words of Hugo Herbelin "the art of extensible parsing is a delicate one" and indeed it is, but Coq's achieved a pretty great implementation of it.
I wish to use FParsec for a python-like language, indentation-based.
I understand that this must be done in the lexing phase, but FParsec don't have a lexing phase. Is possible to use FParsec, or, how can feed it after lexing?
P.D: I'm new at F#, but experienced in other languages
Yes, it's possible.
Here is a relevant article by FParsec author. If you want to go deeper on the subject, this paper might worth a read. The paper points out that there are multiple packages for indentation-aware parsing that based on Parsec, the parser combinator that inspires FParsec.
FParsec doesn't have a separate lexing phase but instead it fuses lexing and parsing to a single phase. IMO indentation-aware parsing is better to be done with parser combinators (FParsec) than parser generators (fslex/fsyacc). The reason is that you need to manually track current indentation and report good error messages based on contexts.
I am going to write a parser of verilog (or vhdl) language and will do a lot of manipulations (sort of transformations) of the parsed data. I intend to parse really big files (full Verilog designs, as big as 10K lines) and I will ultimately support most of the Verilog. I don't mind typing but I don't want to rewrite any part of the code whenever I add support for some other rule.
In Haskell, which library would you recommend? I know Haskell and have used Happy before (to play). I feel that there are possibilities in using Parsec for transforming the parsed string in the code (which is a great plus). I have no experience with uu-paringlib.
So to parse a full-grammar of verilog/VHDL which one of them is recommended? My main concern is the ease and 'correctness' with which I can manipulate the parsed data at my whim. Speed is not a primary concern.
I personally prefer Parsec with the help of Alex for lexing.
I prefer Parsec over Happy because 1) Parsec is a library, while Happy is a program and you'll write in a different language if you use Happy and then compile with Happy. 2) Parsec gives you context-sensitive parsing abilities thanks to its monadic interface. You can use extra state for context-sensitive parsing, and then inspect and decide depending on that state. Or just look at some parsed value before and decide on next parsers etc. (like a <- parseSomething; if test a then ... do ...) And when you don't need any context-sensitive information, you can simply use applicative style and get an implementation like implemented in YACC or a similar tool.
As a downside of Parsec, you'll never know if your Parsec parser contains a left recursion, and your parser will get stuck in runtime (because Parsec is basically a top-down recursive-descent parser). You have to find left recursions and eliminate them. YACC-style parsers can give you some static guarantees and information (like shift/reduce conflicts, unused terminals etc.) that you can't get with Parsec.
Alex is highly recommended for lexing in both situations (I think you have to use Alex if you decide to go on with Happy). Because even if you use Parsec, it really simplifies your parser implementation, and catches a great deal of bugs too (for example: parsing a keyword as an identifier was a common bug I did while I was using Parsec without Alex. It's just one example).
You can have a look at my Lua parser implemented in Alex+Parsec And here's the code to use Alex-generated tokens in Parsec.
EDIT: Thanks John L for corrections. Apparently you can do context-sensitive parsing with Happy too. Also, Alex for lexing is not required in Happy, though it's recommended.
There are many open sourced parser implementations available to us in Haskell. Parsec seems to be the standard for text parsing and attoparsec seems to be a popular choice for binary parsing but I don't know much beyond that. Is there a particular decision tree that you follow for choosing a parser implementation? Have you learned anything interesting about the strengths or weaknesses of the libraries?
You have several good options.
For lightweight parsing of String types:
parsec
polyparse
For packed bytestring parsing, e.g. of HTTP headers.
attoparsec
For actual binary data most people use either:
binary -- for lazy binary parsing
cereal -- for strict binary parsing
The main question to ask yourself is what is the underlying string type?
String?
bytestring (strict)?
bytestring (lazy)?
unicode text
That decision largely determines which parser toolset you'll use.
The second question to ask is: do I already have a grammar for the data type? If so, I can just use happy
The Happy parser generator
And obviously for custom data types there are a variety of good existing parsers:
XML
haxml
xml-light
hxt
hexpat
CSV
bytestring-csv
csv
JSON
json
rss/atom
feed
Just to add to Don's post: Personally, I quite like Text.ParserCombinators.ReadP (part of base) for no-nonsense quick and easy stuff. Particularly when Parsec seems like overkill.
There is a bytestringreadp library for the bytestring version, but it doesn't cover Char8 bytestrings, and I suspect attoparsec would be a better choice at this point.
I recently converted some code from Parsec to Attoparsec. Both are quite capable.
Attoparsec wins on performance and memory footprint, but Parsec provides better error reporting and has more complete documentation.
Bryan O’Sullivan’s blog post What’s in a parser? Attoparsec rewired (2/2) includes a nice performance benchmark comparing several implementations along with some comments comparing memory usage.