Racket, ragg: doesn't accept token stream - parsing

Using #lang ragg in Racket to create a parser for a simple language. Here is the grammar:
#lang ragg
program : sexpr* start* layout
sexpr : SEXPR
start : WORD WORD "[" WORD* "=" ">" sexpr "]"
layout : elem*
elem : info | text | sexpr
info : "{" text "}"
text : WORD*
And here are the first few tokens:
#(struct:token-struct SEXPR (define a 'something) #f 1 0 #f #f)
#(struct:token-struct WORD run #f 1 22 #f #f)
#(struct:token-struct WORD ram-loop #f 1 26 #f #f)
#(struct:token-struct [ [ #f 1 35 #f #f)
#(struct:token-struct WORD ram #f 1 36 #f #f)
#(struct:token-struct WORD max-ram #f 1 40 #f #f)
#(struct:token-struct = = #f 1 48 #f #f)
#(struct:token-struct > > #f 1 49 #f #f)
(edit) And the input text is: (define a 'something) run ram-loop [ram max-ram => (format \"~a/~a GB RAM\" ram max-ram)] {ram} {fg:#966} {cpu} Some text (leftsep #363)
(edit) And of course.. the error I get when calling (parse) on the token stream:
; Encountered unexpected token #\[ ("[") while parsing #f [line=1, column=35,
; offset=#f] [,bt for context]
It seems obvious that SEXPR WORD WORD [ should be accepted by the grammar. Any idea why it is not?

Related

ANTLR Lexer matching the wrong rule

I'm working on a lexer and parser for an old object oriented chat system (MOO in case any readers are familiar with its language). Within this language, any of the below examples are valid floating point numbers:
2.3
3.
.2
3e+5
The language also implements an indexing syntax for extracting one or more characters from a string or list (which is a set of comma separated expressions enclosed in curly braces). The problem arises from the fact that the language supports a range operator inside the index brackets. For example:
a = foo[1..3];
I understand that ANTLR wants to match the longest possible match first. Unfortunately this results in the lexer seeing '1..3' as two floating points numbers (1. and .3), rather than two integers with a range operator ('..') between them. Is there any way to solve this short of using lexer modes? Given that the values inside of an indexing expression can be any valid expression, I would have to duplicate a lot of token rules (essentially all but the floating point numbers as I understand it). Now granted I'm new to ANTLR so I'm sure I'm missing something and any help is much appreciated. I will supply my lexer grammar below:
lexer grammar MooLexer;
channels { COMMENTS_CHANNEL }
SINGLE_LINE_COMMENT
: '//' INPUT_CHARACTER* -> channel(COMMENTS_CHANNEL);
DELIMITED_COMMENT
: '/*' .*? '*/' -> channel(COMMENTS_CHANNEL);
WS
: [ \t\r\n] -> channel(HIDDEN)
;
IF
: I F
;
ELSE
: E L S E
;
ELSEIF
: E L S E I F
;
ENDIF
: E N D I F
;
FOR
: F O R;
ENDFOR
: E N D F O R;
WHILE
: W H I L E
;
ENDWHILE
: E N D W H I L E
;
FORK
: F O R K
;
ENDFORK
: E N D F O R K
;
RETURN
: R E T U R N
;
BREAK
: B R E A K
;
CONTINUE
: C O N T I N U E
;
TRY
: T R Y
;
EXCEPT
: E X C E P T
;
ENDTRY
: E N D T R Y
;
IN
: I N
;
SPLICER
: '#';
UNDERSCORE
: '_';
DOLLAR
: '$';
SEMI
: ';';
COLON
: ':';
DOT
: '.';
COMMA
: ',';
BANG
: '!';
OPEN_QUOTE
: '`';
SINGLE_QUOTE
: '\'';
LEFT_BRACKET
: '[';
RIGHT_BRACKET
: ']';
LEFT_CURLY_BRACE
: '{';
RIGHT_CURLY_BRACE
: '}';
LEFT_PARENTHESIS
: '(';
RIGHT_PARENTHESIS
: ')';
PLUS
: '+';
MINUS
: '-';
STAR
: '*';
DIV
: '/';
PERCENT
: '%';
PIPE
: '|';
CARET
: '^';
ASSIGNMENT
: '=';
QMARK
: '?';
OP_AND
: '&&';
OP_OR
: '||';
OP_EQUALS
: '==';
OP_NOT_EQUAL
: '!=';
OP_LESS_THAN
: '<';
OP_GREATER_THAN
: '>';
OP_LESS_THAN_OR_EQUAL_TO
: '<=';
OP_GREATER_THAN_OR_EQUAL_TO
: '>=';
RANGE
: '..';
ERROR
: 'E_NONE'
| 'E_TYPE'
| 'E_DIV'
| 'E_PERM'
| 'E_PROPNF'
| 'E_VERBNF'
| 'E_VARNF'
| 'E_INVIND'
| 'E_RECMOVE'
| 'E_MAXREC'
| 'E_RANGE'
| 'E_ARGS'
| 'E_NACC'
| 'E_INVARG'
| 'E_QUOTA'
| 'E_FLOAT'
;
OBJECT
: '#' DIGIT+
| '#-' DIGIT+
;
STRING
: '"' ( ESC | [ !] | [#-[] | [\]-~] | [\t] )* '"';
INTEGER
: DIGIT+;
FLOAT
: DIGIT+ [.] (DIGIT*)? (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| [.] DIGIT+ (EXPONENTNOTATION EXPONENTSIGN DIGIT+)?
| DIGIT+ EXPONENTNOTATION EXPONENTSIGN DIGIT+
;
IDENTIFIER
: (LETTER | DIGIT | UNDERSCORE)+
;
LETTER
: LOWERCASE
| UPPERCASE
;
/*
* fragments
*/
fragment LOWERCASE
: [a-z] ;
fragment UPPERCASE
: [A-Z] ;
fragment EXPONENTNOTATION
: ('E' | 'e');
fragment EXPONENTSIGN
: ('-' | '+');
fragment DIGIT
: [0-9] ;
fragment ESC
: '\\"' | '\\\\' ;
fragment INPUT_CHARACTER
: ~[\r\n\u0085\u2028\u2029];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
No, AFAIK, there is no way to solve this using lexer modes. You'll need a predicate with a bit of target specific code. If Java is your target, that might look like this:
lexer grammar RangeTestLexer;
FLOAT
: [0-9]+ '.' [0-9]+
| [0-9]+ '.' {_input.LA(1) != '.'}?
| '.' [0-9]+
;
INTEGER
: [0-9]+
;
RANGE
: '..'
;
SPACES
: [ \t\r\n] -> skip
;
If you run the following Java code:
Lexer lexer = new RangeTestLexer(CharStreams.fromString("1 .2 3. 4.5 6..7 8 .. 9"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-20s `%s`\n", RangeTestLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
you'll get the following output:
INTEGER `1`
FLOAT `.2`
FLOAT `3.`
FLOAT `4.5`
INTEGER `6`
RANGE `..`
INTEGER `7`
INTEGER `8`
RANGE `..`
INTEGER `9`
EOF `<EOF>`
The { ... }? is the predicate and the embedded code must evaluate to a boolean. In my example, the Java code _input.LA(1) != '.' returns true if the character stream 1 step ahead of the current position does not equal a '.' char.

Unambiguous grammar for expressions with let and addition

What is an unambiguous grammar equivalent to to the following ambiguous grammar for a language of expressions with let and addition?
E ⇒ let id = E in E
E ⇒ E + E
E ⇒ num
The ambiguity should be solved so that:
addition is left associative
addition has higher precedence than let expressions when it appears on the right
addition has lower precedence than let expressions when it appears on the left
Using braces to show the grouping of sub-expressions, the following illustrates how expressions should be interpreted:
num + num + num => { num + num } + num
let id = num in num + num => let id = num in { num + num }
num + let id = num in num => num + { let id = num in num }
Consider the expression
E1 + E2
E1 cannot have the form let ID = E3 because let ID = E3 + E2 must be parsed as let ID = (E3 + E2). This restriction is recursive: it also cannot have the form E4 + let ID = E3.
E2 can have the form let ID = E3 but it cannot have the form E3 + E4 (because E1 + E3 + E4 must be parsed as (E1 + E3) + E4). Only E1 can have the form E3 + E4.
It's straight-forward (but repetitive) to translate these restrictions to BNF:
Expr ⇒ Sum
Sum ⇒ SumNoLet '+' Atom
| Atom
SumNoLet ⇒ SumNoLet '+' AtomNoLet
| AtomNoLet
AtomNoLet ⇒ num
| id
| '(' Expr ')'
Atom ⇒ AtomNoLet
| 'let' id '=' Expr
To make the pattern clearer, we can add the * operator:
Expr ⇒ Sum
Sum ⇒ SumNoLet '+' Prod
| Prod
SumNoLet ⇒ SumNoLet '+' ProdNoLet
| ProdNoLet
Prod ⇒ ProdNoLet '*' Atom
| Atom
ProdNoLet ⇒ ProdNoLet '*' AtomNoLet
| AtomNoLet
AtomNoLet ⇒ num
| id
| '(' Expr ')'
Atom ⇒ AtomNoLet
| 'let' id '=' Expr
It is possible to implement this in bison (or other similar parser generators) using precedence declarations. But the precedence solution is harder to reason about, and can be confusing to incorporate into more complicated grammars.

Racket - define one character with token-char

I am working on a project for a class and we are tasked with writing a scanner for numbers, symbols, comments, arithmetic operators, parenthesis, and EOF in both Python and Racket. I am working on the racket version and I have written the following line to define one or more character as a symbol:
[(any-char) (token-CHAR (string->character lexeme))]
I have the following line to define on or more digits as a number:
[(:+ digit) (token-NUM (string->number lexeme))]
I am very new to Racket, this is my third program, so I am not exactly sure how to approach this, so any suggestions are greatly appreciated. I have scoured the Racket documentation, but I wasn't able to find what I was looking for.
Thanks!
Here is a minimal getting-started example - heavily commented.
#lang racket
;;; IMPORT
;; Import the lexer tools
(require parser-tools/yacc
parser-tools/lex
(prefix-in : parser-tools/lex-sre) ; names from lex-sre are prefixed with :
; to avoid name collisions
syntax/readerr)
;;; REGULAR EXPRESSIONS
;; Names for regular expressions matching letters and digits.
;; Note that :or are prefixed with a : due to (prefix-in : ...) above
(define-lex-abbrevs
[letter (:or (:/ "a" "z") (:/ #\A #\Z) )]
[digit (:/ #\0 #\9)])
;;; TOKENS
;; Tokens such as numbers (and identifiers and strings) carry a value
;; In the example only the NUMBER token is used, but you may need more.
(define-tokens value-tokens (NUMBER IDENTIFIER STRING))
;; Tokens that don't carry a value.
(define-empty-tokens op-tokens (newline := = < > + - * / ^ EOF))
;;; LEXER
;; Here the lexer (aka the scanner) is defined.
;; The construct lexer-src-pos evaluates to a function which scans an input port
;; returning one position-token at a time.
;; A position token contains besides the actual token also source location information
;; (i.e. you can see where in the file the token was read)
(define lex
(lexer-src-pos
[(eof) ; input: eof of file
'EOF] ; output: the symbol EOF
[(:or #\tab #\space #\newline) ; input: whitespace
(return-without-pos (lex input-port))] ; output: the next token
; (i.e. skip the whitespace)
[#\newline ; input: newline
(token-newline)] ; ouput: a newline-token
; ; note: (token-newline) returns 'newline
[(:or ":=" "+" "-" "*" "/" "^" "<" ">" "=") ; input: an operator
(string->symbol lexeme)] ; output: corresponding symbol
[(:+ digit) ; input: digits
(token-NUMBER (string->number lexeme))])) ; outout: a NUMBER token whose value is
; ; the number
; ; note: (token-value token)
; returns the number
;;; TEST
(define input (open-input-string "123+456"))
(lex input) ; (position-token (token 'NUMBER 123) (position 1 #f #f) (position 4 #f #f))
(lex input) ; (position-token '+ (position 4 #f #f) (position 5 #f #f))
(lex input) ; (position-token (token 'NUMBER 456) (position 5 #f #f) (position 8 #f #f))
(lex input) ; (position-token 'EOF (position 8 #f #f) (position 8 #f #f))
;; Let's make it a little easier to play with the lexer.
(define (string->tokens s)
(port->tokens (open-input-string s)))
(define (port->tokens in)
(define token (lex in))
(if (eq? (position-token-token token) 'EOF)
'()
(cons token (port->tokens in))))
(map position-token-token (string->tokens "123*45/3")) ; strip positions
; Output:
; (list (token 'NUMBER 123)
; '*
; (token 'NUMBER 45)
; '/
; (token 'NUMBER 3))

Drupal incorrect node URL

My website is using Drupal. On first page I have a list of entries and each entry has own URL.
In some cases, I can't understand when, my links looks like:
/node/1%2C157
Instead of:
/node/1157
In my view I found that my content is displayed by this line:
<?php print render($page['content']); ?>
I need to understand where my content is generated and to fix this problem.
Or maybe I can fix this problem from admin panel ?
( I'm using Pathauto and Path modules to rewrite URL's, on other content is working well but in some nodes I have this problem. I tried to regenerate, to remove and generate again but for this nodes nothing happened ).
The %2C occurs in the url because of comma.Please see if you are creating any url with comma in it.This is possibly the cause of such url.
Here is the list of url encoded characters.
URL Encoded Characters
backspace %08
tab %09
linefeed %0A
creturn %0D
space %20
! %21
" %22
# %23
$ %24
% %25
& %26
' %27
( %28
) %29
* %2A
+ %2B
, %2C
- %2D
. %2E
/ %2F
0 %30
1 %31
2 %32
3 %33
4 %34
5 %35
6 %36
7 %37
8 %38
9 %39
: %3A
; %3B
< %3C
= %3D
> %3E
? %3F
# %40
A %41
B %42
C %43
D %44
E %45
F %46
G %47
H %48
I %49
J %4A
K %4B
L %4C
M %4D
N %4E
O %4F
P %50
Q %51
R %52
S %53
T %54
U %55
V %56
W %57
X %58
Y %59
Z %5A
[ %5B
\ %5C
] %5D
^ %5E
_ %5F
` %60
a %61
b %62
c %63
d %64
e %65
f %66
g %67
h %68
i %69
j %6A
k %6B
l %6C
m %6D
n %6E
o %6F
p %70
q %71
r %72
s %73
t %74
u %75
v %76
w %77
x %78
y %79
z %7A
{ %7B
| %7C
} %7D
~ %7E
¢ %A2
£ %A3
¥ %A5
| %A6
§ %A7
« %AB
¬ %AC
¯ %AD
º %B0
± %B1
ª %B2
, %B4
µ %B5
» %BB
¼ %BC
½ %BD
¿ %BF
À %C0
Á %C1
 %C2
à %C3
Ä %C4
Å %C5
Æ %C6
Ç %C7
È %C8
É %C9
Ê %CA
Ë %CB
Ì %CC
Í %CD
Î %CE
Ï %CF
Ð %D0
Ñ %D1
Ò %D2
Ó %D3
Ô %D4
Õ %D5
Ö %D6
Ø %D8
Ù %D9
Ú %DA
Û %DB
Ü %DC
Ý %DD
Þ %DE
ß %DF
à %E0
á %E1
â %E2
ã %E3
ä %E4
å %E5
æ %E6
ç %E7
è %E8
é %E9
ê %EA
ë %EB
ì %EC
í %ED
î %EE
ï %EF
ð %F0
ñ %F1
ò %F2
ó %F3
ô %F4
õ %F5
ö %F6
÷ %F7
ø %F8
ù %F9
ú %FA
û %FB
ü %FC
ý %FD
þ %FE
ÿ %FF

Parsing columns of data with parsec

I'm writing a parser to scan columns of numbers. like this :
T LIST2 LIST3 LIST4
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
5 246 344 501
6 929 138 474
The first line contain the name of the lists and I would like my program to parse exactly the same number of data as in the title (to exclude arrays with incoherent number of titles or columns).
I wrote this program :
title = do
tit <- many1 alphaNum
return tit
digits = do
dig <- many1 digit
return dig
parseSeries = do
spaces
titles <- title `sepBy` spaces
let nb = length titles
dat <- endBy (count (nb-1) (digits `sepBy` spaces)) endOfLine
spaces
return (titles,concat dat)
main = do
fichier <- readFile ("test_list3.txt")
putStrLn $ fichier
case parse parseSeries "(stdin)" fichier of
Left error -> do putStrLn "!!! Error !!!"
print error
Right (tit,resu) -> do
mapM_ putStrLn tit
mapM_ putStrLn (concat resu)
but when I try to parse a file with this kind of data, I have the following error :
!!! Error !!!
"(stdin)" (line 26, column 1):
unexpected end of input
expecting space or letter or digit
I'm a newbie with parsing and I don't understand why it fail?
Do you have an idea of what is wrong with my parser ?
Your program is doing something different than what you expect. The key part is right here:
parseSeries = do
spaces
titles <- title `sepBy` spaces
let nb = length titles
-- The following is the incorrect part
dat <- endBy (count (nb-1) (digits `sepBy` spaces)) endOfLine
spaces
return (titles,concat dat)
I believe what you actually wanted was:
parseSeries = do
spaces
titles <- title `sepBy` spaces
let nb = length titles
let parseRow = do
column <- digits
columns <- count (nb - 1) (spaces *> digits)
newline
return (column:columns)
dat <- many parseRow
return (titles, dat)

Resources