Lexing and Parsing CSS hierarchy - parsing

.someDiv { width:100%; height:100%; ... more properties ... }
How would I make up a rule in my parser that will match the string above?
Seems rather impossible for me, since you cant define an unlimited amount of properties in the rule? Could someone please clarify, how you would do such a thing with FsLex and FsYacc?

If you're using FsLex and FsYacc then you can parse the properties inside { ... } as a list of properties. Assuming you have a lexer that properly recognizes all the special characters and you have a rule that parses individual property, you can write something like:
declaration:
| navigators LCURLY propertyList RCURLY { Declaration($1, $3) }
| navigators LCURLY RCURLY { Declaration($1, []) }
propertyList:
| property SEMICOLON propertyList { $1::$2 }
| property { [$1] }
property:
| IDENTIFIER COLON values { Property($1, $3) }
The declaration rule parses the entire declaration (you'll need to write parser for various navigators that can be used in CSS such as div.foo #id etc.) The propertyList rule parses one property and then calls itself recursively to parse multiple properties.
The values constructed on the right-hand-side are going to be a list of values representing individual property. The property rule parses individual property assignment e.g. width:100% (but you'll need to finish parsing of values, because that can be a list or a more compelx expression).

Related

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

What is the double brace syntax in ASN.1?

I'm reading the PKCS #7 ASN.1 definition, and came across this type. I can't seem to find out what {{Authenticated}} is doing in this code, or what production this would be called. I've also seen as {{...}} in the PKCS #8 standard.
-- ATTRIBUTE information object class specification
ATTRIBUTE ::= CLASS {
&derivation ATTRIBUTE OPTIONAL,
&Type OPTIONAL, -- either &Type or &derivation required
&equality-match MATCHING-RULE OPTIONAL,
&ordering-match MATCHING-RULE OPTIONAL,
&substrings-match MATCHING-RULE OPTIONAL,
&single-valued BOOLEAN DEFAULT FALSE,
&collective BOOLEAN DEFAULT FALSE,
&dummy BOOLEAN DEFAULT FALSE,
-- operational extensions
&no-user-modification BOOLEAN DEFAULT FALSE,
&usage AttributeUsage DEFAULT userApplications,
&id OBJECT IDENTIFIER UNIQUE
}
WITH SYNTAX {
[SUBTYPE OF &derivation]
[WITH SYNTAX &Type]
[EQUALITY MATCHING RULE &equality-match]
[ORDERING MATCHING RULE &ordering-match]
[SUBSTRINGS MATCHING RULE &substrings-match]
[SINGLE VALUE &single-valued]
[COLLECTIVE &collective]
[DUMMY &dummy]
[NO USER MODIFICATION &no-user-modification]
[USAGE &usage]
ID &id
}
Authenticated ATTRIBUTE ::= {
contentType |
messageDigest |
-- begin added for VCE SCEP-support
transactionID |
messageType |
pkiStatus |
failInfo |
senderNonce |
recipientNonce,
-- end added for VCE SCEP-support
..., -- add application-specific attributes here
signingTime
}
SignerInfoAuthenticatedAttributes ::= CHOICE {
aaSet [0] IMPLICIT SET OF AttributePKCS-7 {{Authenticated}},
aaSequence [2] EXPLICIT SEQUENCE OF AttributePKCS-7 {{Authenticated}}
-- Explicit because easier to compute digest on sequence of attributes and then reuse
-- encoded sequence in aaSequence.
}
-- Also defined in X.501
-- Redeclared here as a parameterized type
AttributePKCS-7 { ATTRIBUTE:IOSet } ::= SEQUENCE {
type ATTRIBUTE.&id({IOSet}),
values SET SIZE (1..MAX) OF ATTRIBUTE.&Type({IOSet}{#type})
}
-- Inlined from PKCS5v2-0 since it is the only thing imported from that module
-- AlgorithmIdentifier { ALGORITHM-IDENTIFIER:InfoObjectSet } ::=
AlgorithmIdentifier { TYPE-IDENTIFIER:InfoObjectSet } ::=
SEQUENCE {
-- algorithm ALGORITHM-IDENTIFIER.&id({InfoObjectSet}),
algorithm TYPE-IDENTIFIER.&id({InfoObjectSet}),
-- parameters ALGORITHM-IDENTIFIER.&Type({InfoObjectSet}
parameters TYPE-IDENTIFIER.&Type({InfoObjectSet}
{#algorithm}) OPTIONAL }
-- Private-key information syntax
PrivateKeyInfo ::= SEQUENCE {
version Version,
-- privateKeyAlgorithm AlgorithmIdentifier {{PrivateKeyAlgorithms}},
privateKeyAlgorithm AlgorithmIdentifier {{...}},
privateKey PrivateKey,
attributes [0] Attributes OPTIONAL }
There is no ASN.1 item called double brace. Each single brace (even when nested) is a separate token. Since the definition of AttributePKCS-7 is not given here, I am guessing that it is likely a parametrized definition that takes an Information Object Set as the parameter. The outer pair of braces would be an indication of parameter substitution while the inner pair of braces indicates that Authenticated is an Information Object Set (which is used as the parameter). The purpose of the information object set is to restrict the possible values of certain fields to those contained in the object set. You will need to look at the definition of AttributePKCS-7 to see what components are being restricted by the object set.
As for the {{...}}, this is similar to the above except that the object set is an empty extensible object set (indicated by the {...}) which is being used as a parameter (indicated by the outer pair of braces).

Is it a good bison practice to not have semantic values, but use side effects of actions?

I have a language where the semantic meaning of everything, is an array of characters, or array of arrays. So I have the following YYSTYPE:
typedef struct _array {
union {
char *chars; // start of string
void *base; // start of array
};
unsigned n; // number of valid elements in above
unsigned allocated; // number of allocated elements for above
} array;
#define YYSTYPE array
and I can append an array of characters to an array of arrays with
void append(YYSTYPE *parray, YYSTYPE *string);
Suppose the grammar (SSCCE) is:
%token WORD
%%
array : WORD
| array WORD
;
So I accept a sequence of words. For each word, the semantic value becomes that array of characters, and then I would like to append each of these, to the array of arrays, for the whole sequence.
There are several possible ways to design the actions:
Have array symbol have the semantic value of type array. If I do this, then the action for array WORD will have to copy the array $1 to $$ which is slow, so I don't like that.
Have array symbol have the semantic value of type array *. Now the action for array WORD, I can just add to the array *$1 and then set $$ to be equal to $1. But I don't like this for two reasons. First, the semantic meaning is not a pointer to array, it is the array. Second, for the action for the rule array : WORD, I will have to malloc the structure, which is slow. Yes, the 'append' sometimes does a malloc, but if I allocate enough, not frequently. I want to avoid any unnecessary malloc for performance reasons.
Forget about trying to have a semantic value for the symbol array at all, and use globals:
static YYSTYPE g_array;
YYSTYPE *g_parray = &g_array;
and then, the actions will just use
append(g_parray, word_array)
The way the whole grammar works, I don't need more than one g_array. The above is the fastest I can think of. But it is really bad design - lots of globals, no semantic values, instead, everything happens by side effects to globals.
So, personally I don't like any of them. Which is the commonly accepted best practice for bison?
In most cases, there is no point in using globals. More-or-less modern versions of bison have the %parse-param directive, which allows you to have a sort of 'parsing context'. The context may take care of all memory allocations etc.
It may reflect the current parsing state - i. e. have the notion of 'current array' etc. In this case, your semantic actions can rely on context knowing where you are.
%{
typedef struct tagContext Context;
typedef struct tagCharString CharString;
void start_words(Context* ctx);
void add_word(Context* ctx, CharString* word);
%}
%union {
CharString* word;
}
%parse-param {Context* ctx}
%token<word> WORD
%start words
%%
words
: { start_words(ctx); } word
| words word
;
word
: WORD { add_word(ctx, $1); }
;
If you only parse a list of words and nothing else, you can make it your context.
However, in a simple grammar, it is much clearer if you pass information through YYSTYPE:
%{
typedef struct tagContext Context;
typedef struct tagCharString CharString;
typedef struct tagWordList WordList;
// word_list = NULL to start a new list
WordList* add_word(Context* ctx, WordList* prefix, CharString* word);
%}
%union {
CharString* word;
WordList* word_list;
}
%parse-param {Context* ctx}
%token<word> WORD
%type<word_list> words words_opt
%start words
%%
words
: words_opt WORD { $words = add_word(ctx, $words_opt, $WORD); }
;
words_opt
: %empty { $words_opt = NULL; }
| words
;
Performance differences between the two approaches seem to be negligible.
Memory cleanup
If your input text is parsed without errors, you are always responsible of cleaning up all dynamic memory. However, if your input text causes parse errors, the parser will have to discard some tokens. There may be two approaches to cleanup in this case.
First, you can keep track of all memory allocations in your context and free them all when destroying the context.
Second, you can rely on bison destructors:
%{
void free_word_list(WordList* word_list);
%}
%destructor { free_word_list($$); } <word_list>

How to define syntax

I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
{
var name = CreateTerm("name");
var value = new IdentifierTerminal("value");
var queries = new NonTerminal("queries");
var query = new NonTerminal("query");
queries.Rule = MakePlusRule(queries, null, query);
query.Rule = name + ":" + value;
Root = queries;
}
private IdentifierTerminal CreateTerm(string name)
{
IdentifierTerminal term = new IdentifierTerminal(name, "!##$%^*_'.?-", "!##$%^*_'.?0123456789");
term.CharCategories.AddRange(new[]
{
UnicodeCategory.UppercaseLetter, //Ul
UnicodeCategory.LowercaseLetter, //Ll
UnicodeCategory.TitlecaseLetter, //Lt
UnicodeCategory.ModifierLetter, //Lm
UnicodeCategory.OtherLetter, //Lo
UnicodeCategory.LetterNumber, //Nl
UnicodeCategory.DecimalDigitNumber, //Nd
UnicodeCategory.ConnectorPunctuation, //Pc
UnicodeCategory.SpacingCombiningMark, //Mc
UnicodeCategory.NonSpacingMark, //Mn
UnicodeCategory.Format //Cf
});
//StartCharCategories are the same
term.StartCharCategories.AddRange(term.CharCategories);
return term;
}
but this doesn't work if the values include spaces. Can this be done (using Irony) without modifying the syntax (like adding quotes around values)?
Many thanks!
If newlines were included between key-value pairs, it would be easily achievable. I have no knowledge of "Irony", but my initial feeling is that almost no parser/lexer generator is going to deal with this given only a naive grammar description. This requires essentially unbounded lookahead.
Conceptually (because I know nothing about this product), here's how I would do it:
Tokenise based on spaces and colons (i.e. every continguous sequence of characters that isn't a space or a colon is an "identifier" token of some sort).
You then need to make it such that every "sentence" is described from colon-to-colon:
sentence = identifier_list
| : identifier_list identifier : sentence
That's not enough to make it work, but you get the idea at least, I hope. You would need to be very careful to distinguish an identifier_list from a single identifier such that they could be parsed unambiguously. Similarly, if your tool allows you to define precedence and associativity, you might be able to get away with making ":" bind very tightly to the left, such that your grammar is simply:
sentence = identifier : identifier_list
And the behaviour of that needs to be (identifier :) identifier_list.

Simple XML parser in bison/flex

I would like to create simple xml parser using bison/flex. I don't need validation, comments, arguments, only <tag>value</tag>, where value can be number, string or other <tag>value</tag>.
So for example:
<div>
<mul>
<num>20</num>
<add>
<num>1</num>
<num>5</num>
</add>
</mul>
<id>test</id>
</div>
If it helps, I know the names of all tags that may occur. I know how many sub-tag can be hold by given tag. Is it possible to create bison parser that would do something like that:
- new Tag("num", 1) // tag1
- new Tag("num", 5) // tag2
- new Tag("add", tag1, tag2) // tag3
- new Tag("num", 20) // tag4
- new Tag("mul", tag4, tag3)
...
- root = top_tag
Tag & number of sub-tags:
num: 1 (only value)
str: 1 (only value)
add | sub | mul | div: 2 (num | str | tag, num | str | tag)
Could you help me with grammar to be able to create AST like given above?
For your requirements, I think the yax system would work well.
From the README:
The goal of the yax project is to allow the use of YACC (Gnu Bison actually) to parse/process XML documents.
The key piece of software for achieving the above goal is to provide a library that can produce an XML lexical token stream from an XML document.
This stream can be wrapped to create an instance of yylex() to feed tokens to a Bison grammar to parse and process the XML document.
Using the stream plus a Bison grammar, it is possible to carry at least the following kinds of activities.
Validate XML documents,
Directly parse XML documents to create internal data structures,
Construct DOM trees.
I do not think that it's the best tool to use to create a xml parser.
If I have to do this job, I'll do it by hand.
Flex code will contains :
NUM match integer in this example.
STR match match any string which does not contains a '<' or '>'.
STOP match all closing tags.
START match starting tags.
<\?.*\?> { ;}
<[a-z]+> { return START; }
</[a-z]+> { return STOP; }
[0-9]+ { return NUM; }
[^><]+ { return STR; }
Bison code will look like
%token START, STOP, STR, NUM
%%
simple_xml : START value STOP
;
value : simple_xml
| STR
| NUM
| value simple_xml
;

Resources