Limit Coccinelle matches to expression of given type - coccinelle

A transform like the one below, works for func_1(&something->field) when something is a foo_t but does not catch cases where v is itself a field such as func_1(&something->v->field)
##
typedef foo_t;
foo_t *v;
##
- func_1(&v->field)
+ func_2(v)
On the other hand, if I use an expression like this:
##
expression v;
##
- func_1(&v->field)
+ func_2(v)
It works, but is too eager and may match cases where the type is not foo_t, just because some other type has the same field name.
Is there a way to get a match on expressions, but limit the expression type to be foo_t?

Related

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

Rails 5 scope like multiple partial strings

Say I have a scope like this:
scope :by_templates, ->(t) { joins(:template).where('templates.label ~* ?', t) }
How can I retrieve multiple templates with t like so?
Document.first.by_templates(%w[email facebook])
This code returns this error.
PG::DatatypeMismatch: ERROR: argument of AND must be type boolean, not type record
LINE 1: ...template_id" WHERE "documents"."user_id" = $1 AND (templates...
PostgreSQL allows you to apply a boolean valued operator to an entire array of values using the op any(array_expr) construct:
9.23.3. ANY/SOME (array)
expression operator ANY (array expression)
expression operator SOME (array expression)
The right-hand side is a parenthesized expression, which must yield an array value. The left-hand expression is evaluated and compared to each element of the array using the given operator, which must yield a Boolean result. The result of ANY is “true” if any true result is obtained. The result is “false” if no true result is found (including the case where the array has zero elements).
PostgreSQL also supports the array constructor syntax for creating arrays:
array[value, value, ...]
Conveniently, ActiveRecord will expand a placeholder as a comma-delimited list when the value is an array.
Putting these together gives us:
scope :by_templates, ->(templates) { joins(:template).where('templates.label ~* any(array[?])', templates) }
As an aside, if you're using the case-insensitive regex operator (~*) as a case-insensitive comparison (i.e. no real regex pattern matching going on) then you might want to use upper instead:
# Yes, this class method is still a scope.
def self.by_templates(templates)
joins(:template).where('upper(templates.label) = any(array[?])', templates.map(&:upcase) }
end
Then you could add an index to templates on upper(label) to speed things up and avoid possible issues with stray regex metacharacters in the templates. I tend to use upper case for this sort of thing because of oddities lie 'ß'.upcase being 'SS' but 'SS'.downcase being 'ss'.

Why are redundant parenthesis not allowed in syntax definitions?

This syntax module is syntactically valid:
module mod1
syntax Empty =
;
And so is this one, which should be an equivalent grammar to the previous one:
module mod2
syntax Empty =
( )
;
(The resulting parser accepts only empty strings.)
Which means that you can make grammars such as this one:
module mod3
syntax EmptyOrKitchen =
( ) | "kitchen"
;
But, the following is not allowed (nested parenthesis):
module mod4
syntax Empty =
(( ))
;
I would have guessed that redundant parenthesis are allowed, since they are allowed in things like expressions, e.g. ((2)) + 2.
This problem came up when working with the data types for internal representation of rascal syntax definitions. The following code will create the same module as in the last example, namely mod4 (modulo some whitespace):
import Grammar;
import lang::rascal::format::Grammar;
str sm1 = definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
alt({seq([])})
],{})))))));
The problematic part of the data is on its own line - alt({seq([])}). If this code is changed to seq([]), then you get the same syntax module as mod2. If you further delete this whole expression, i.e. so that you get this:
str sm3 =
definition2rascal(\definition("unknown_main",("the-module":\module("unknown",{},{},grammar({sort("Empty")},(sort("Empty"):prod(sort("Empty"),[
], {})))))));
Then you get mod1.
So should such redundant parenthesis by printed by the definition2rascal(...) function? And should it matter with regards to making the resulting module valid or not?
Why they are not allowed is basically we wanted to see if we could do without. There is currently no priority relation between the symbol kinds, so in general there is no need to have a bracket syntax (like you do need to + and * in expressions).
Already the brackets have two different semantics, one () being the epsilon symbol and two (Sym1 Sym2 ...) being a nested sequence. This nested sequence is defined (syntactically) to expect at least two symbols. Now we could without ambiguity introduce a third semantics for the brackets with a single symbol or relax the requirement for sequence... But we reckoned it would be confusing that in one case you would get an extra layer in the resulting parse tree (sequence), while in the other case you would not (ignored superfluous bracket).
More detailed wise, the problem of printing seq([]) is not so much a problem of the meta syntax but rather that the backing abstract notation is more relaxed than the concrete notation (i.e. it is a bigger language or an over-approximation). The parser generator will generate a working parser for seq([]). But, there is no Rascal notation for an empty sequence and I guess the pretty printer should throw an exception.

Write a Lex rule to parse Integer and Float

I am writing a parse for a script language.
I need to recognize strings, integers and floats.
I successfully recognize strings with the rule:
[a-zA-Z0-9_]+ {return STRING;}
But I have problem recognizing Integers and Floats. These are the (wrong) rules I wrote:
["+"|"-"][1-9]{DIGIT}* { return INTEGER;}
["+"|"-"]["0." | [1-9]{DIGIT}*"."]{DIGIT}+ {return FLOAT;}
How can I fix them?
Furthermore, since a "abc123" is a valid string, how can I make sure that it is recognized as a string and not as the concatenation of a string ("abc") and an Integer ("123") ?
First problem: There's a difference between (...) and [...]. Your regular expressions don't do what you think they do because you're using the wrong punctuation.
Beyond that:
No numeric rule recognizes 0.
Both numeric rules require an explicit sign.
Your STRING rule recognizes integers.
So, to start:
[...] encloses a set of individual characters or character ranges. It matches a single character which is a member of the set.
(...) encloses a regular expression. The parentheses are used for grouping, as in mathematics.
"..." encloses a sequence of individual characters, and matches exactly those characters.
With that in mind, let's look at
["+"|"-"][1-9]{DIGIT}*
The first bracket expression ["+"|"-"] is a set of individual characters or ranges. In this case, the set contains: ", +, " (again, which has no effect because a set contains zero or one instances of each member), |, and the range "-", which is a range whose endpoints are the same character, and consequently only includes that character, ", which is already in the set. In short, that was equivalent to ["+|]. It will match one of those three characters. It requires one of those three characters, in fact.
The second bracket expression [1-9] matches one character in the range 1-9, so it probably does what you expected. Again, it matches exactly one character.
Finally, {DIGIT} matches the expansion of the name DIGIT. I'll assume that you have the definition:
DIGIT [0-9]
somewhere in your definitions section. (In passing, I note that you could have just used the character class [:digit:], which would have been unambiguous, and you would not have needed to define it.) It's followed by a *, which means that it will match zero or more repetitions of the {DIGIT} definition.
Now, an example of a string which matches that pattern:
|42
And some examples of strings which don't match that pattern:
-7 # The pattern must start with |, + or "
42 # Again, the pattern must start with |, + or "
+0 # The character following the + must be in the range [0-9]
Similarly, your float pattern, once the [...] expressions are simplified, becomes (writing out the individual pieces one per line, to make it more obvious):
["+|] # i.e. the set " + |
["0.|[1-9] # i.e. the set " 0 | [ 1 2 3 4 5 6 7 8 9
{DIGIT}* # Any number of digits
"." # A single period
] # A single ]
{DIGIT}+ # one or more digits
So here's a possible match:
"..]3
I'll skip over writing out the solution because I think you'll benefit more from doing it yourself.
Now, the other issues:
Some rule should match 0. If you don't want to allow leading zeros, you'll need to just a it as a separate rule.
Use the optional operator (?) to indicate that the preceding object is optional. eg. "foo"? matches either the three characters f, o, o (in order) or matches the empty string. You can use that to make the sign optional.
The problem is not the matching of abc123, as in your question. (F)lex always gives you the longest possible match, and the only rule which could match the starting character a is the string rule, so it will allow the string rule to continue as long as it can. It will always match all of abc123. However, it will also match 123, which you would probably prefer to be matched by your numeric rule. Here, the other (f)lex matching criterion comes into play: when there are two or more rules which could match exactly the same string, and none of the rules can match a longer string, (f)lex chooses the first rule in the file. So if you want to give numbers priority over strings, you have to put the number rule earlier in your (f)lex file than the string rule.
I hope that gives you some ideas about how to fix things.

Let bindings support punctuation enclosed in double backticks, but types don't?

Using F# in Visual Studio 2012, this code compiles:
let ``foo.bar`` = 5
But this code does not:
type ``foo.bar`` = class end
Invalid namespace, module, type or union case name
According to section 3.4 of the F# language specification:
Any sequence of characters that is enclosed in double-backtick marks (````),
excluding newlines, tabs, and double-backtick pairs themselves, is treated
as an identifier.
token ident =
| ident-text
| `` [^ '\n' '\r' '\t']+ | [^ '\n' '\r' '\t'] ``
Section 5 defines type as:
type :=
( type )
type -> type -- function type
type * ... * type -- tuple type
typar -- variable type
long-ident -- named type, such as int
long-ident<types> -- named type, such as list<int>
long-ident< > -- named type, such as IEnumerable< >
type long-ident -- named type, such as int list
type[ , ... , ] -- array type
type lazy -- lazy type
type typar-defns -- type with constraints
typar :> type -- variable type with subtype constraint
#type -- anonymous type with subtype constraint
... and Section 4.2 defines long-ident as:
long-ident := ident '.' ... '.' ident
As far as I can tell from the spec, types are named with long-idents, and long-idents can be idents. Since idents support double-backtick-quoted punctuation, it therefore seems like types should too.
So am I misreading the spec? Or is this a compiler bug?
It definitely looks like the specification is not synchronized with the actual implementation, so there is a bug on one side or the other.
When you use identifier in double backticks, the compiler treats it as a name and simply generates type (or member) with the name you specified in backticks. It does not do any name mangling to make sure that the identifier is valid type/member name.
This means that it is not too surprising that you cannot use identifiers that would clash with some standard meaning in the compiled code. In your example, it is dot, but here are a few other examples:
type ``Foo.Bar``() = // Dot is not allowed because it represents namespace
member x.Bar = 0
type ``Foo`1``() = // Single backtick is used to compile generic types
member x.Bar = 0
type ``Foo+Bar``() = // + is used in the name of a nested type
member x.Bar = 0
The above examples are not allowed as type names (because they clash with some standard meaning), but you can use them in let-bindings, because there are no such restrictions on variable names:
let ``foo`1`` = 0
let ``foo.bar`` = 2
let ``foo+bar`` = 1
This is definitely something that should be explained in the documentation & the specification, but I hope this helps to clarify what is going on.

Resources