I am new to Scheme but I understand recursion and a few things about parsing in general. Does anybody have experience in how to parse YAML (at least part of the spec) using Scheme/Lisp? At this point, I am not looking for efficiency.
Here is the source of a parser for YAML in Racket:
https://github.com/esilkensen/yaml/blob/master/yaml/parser.rkt
It is a recursive descent parser and would be easy to port to RnRS Scheme.
Documentation: http://pkg-build.racket-lang.org/doc/yaml/index.html
Related
How to build YAML parser using tatsu python parser generator?
Parsing indent-based language like YAML is difficult, so I cannot achieve this.
TatSu was used to do experiments and bootstrap the new PEG parser in Python.
You can find the solutions to INDENT/DEDENT I used in the original efforts here:
https://github.com/neogeny/pygl
{"extractorData":{"url":"http://mobcrush.com","resourceId":"VALUE","data":[{"group":[{"Userpart value":[{"text":"Galadon"}]},{"Userpart value":[{"text":"ShinKaigan"}]},{"Userpart value":[{"text":"Minecon2016"}]},{"Userpart value":[{"text":"Asater"}]},{"Userpart value":[{"text":"PixieMethod"}]},{"Userpart value":[{"text":"MrSilent"}]},{"Userpart value":[{"text":"MadeMoiselle"}]},{"Userpart value":[{"text":"RohanLive"}]},{"Userpart value":[{"text":"TheRealMcSlushie"}]},{"Userpart value":[{"text":"gibbs"}]},{"Userpart value":[{"text":"karlminer"}]},{"Userpart value":[{"text":"etowah5"}]},{"Userpart value":[{"text":"Suha"}]},{"Userpart value":[{"text":"esl_hearthstone"}]},{"Userpart value":[{"text":"Feller_Rus"}]},{"Userpart value":[{"text":"『Bel』"}]},{"Userpart value":[{"text":"Tenebray"}]},{"Userpart value":[{"text":"T3x05"}]},{"Userpart value":[{"text":"rikkrollins"}]},{"Userpart value":[{"text":"xwarpewpew"}]}]}]},"pageData":{"resourceId":"VALUE","statusCode":200,"timestamp":1474736137294},"url":"http://mobcrush.com","runtimeConfigId":"VALUE","timestamp":1474736451447,"sequenceNumber":-1}
1) Identify the type of data this is [showing us an example only helps us eliminate what it is not]. Is it JSON?
2) Get a parser for that kind of data, or build such a parser. For standard types of data exchange formats like JSON, there are typically parser libraries for major languages already available. If not, how to build parsers is well understood and you can build such a parser.
[See my SO article on how to build recursive descent parsers by hand.]
I wish to use FParsec for a python-like language, indentation-based.
I understand that this must be done in the lexing phase, but FParsec don't have a lexing phase. Is possible to use FParsec, or, how can feed it after lexing?
P.D: I'm new at F#, but experienced in other languages
Yes, it's possible.
Here is a relevant article by FParsec author. If you want to go deeper on the subject, this paper might worth a read. The paper points out that there are multiple packages for indentation-aware parsing that based on Parsec, the parser combinator that inspires FParsec.
FParsec doesn't have a separate lexing phase but instead it fuses lexing and parsing to a single phase. IMO indentation-aware parsing is better to be done with parser combinators (FParsec) than parser generators (fslex/fsyacc). The reason is that you need to manually track current indentation and report good error messages based on contexts.
I'm writing a tool with it's own built-in language similar to Python. I want to make indentation meaningful in the syntax (so that tabs and spaces at line beginning would represent nesting of commands).
What is the best way to do this?
I've written recursive-descent and finite automata parsers before.
The current CPython's parser seems to be generated using something called ASDL.
Regarding the indentation you're asking for, it's done using special lexer tokens called INDENT and DEDENT. To replicate that, just implement those tokens in your lexer (that is pretty easy if you use a stack to store the starting columns of previous indented lines), and then plug them into your grammar as usual (like any other keyword or operator token).
Check out the python compiler and in particular compiler.parse.
I'd suggest ANTLR for any lexer/parser generation ( http://www.antlr.org ).
Also, this website ( http://erezsh.wordpress.com/2008/07/12/python-parsing-1-lexing/ ) has some more information, in particular:
Python’s indentation cannot be solved with a DFA. (I’m still perplexed at whether it can even be solved with a context-free grammar).
PyPy produced an interesting post about lexing Python (they intend to solve it using post-processing the lexer output)
CPython’s tokenizer is written in C. It’s ad-hoc, hand-written, and
complex. It is the only official implementation of Python lexing that
I know of.
I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.
This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.