How do I represent an alphanumeric string in DCG? - parsing

I am trying to write a parser for propositional calculus encoded as S-expressions.
I have made some progress:
expression --> op.
op --> ['('], bin-op, bool, bool, [')'].
op --> ['('], unary-op, bool, [')'].
bool --> tok.
bool --> op.
bin-op --> ["IFF"].
bin-op --> ["IF"].
bin-op --> ["XOR"].
unary-op --> ["NOT"].
tok --> ["a"].
In swipl, I get an appropriate response from calling phrase:
?- phrase(expression, Ls).
Ls = ['(', "IFF", "a", "a", ')']
However this is only for the tok "a". Is there a way to say "tok is any alphanumeric string" in DCG? I found this but I'm unsure how to apply it to what I'm doing.

If you just want to parse, then the following token will work.
tok([A|B], B) :- an_code(A).
alpha_numeric(X) :-
between(0'0, 0'9, X); between(0'A, 0'Z, X); between(0'a, 0'z, X).
an_code(A) :- atom_codes(A, C), maplist(alpha_numeric, C).
?- phrase(expression, ['(', "IFF", "A1", "1A", ')']).
true
?- phrase(expression, ['(', "IFF", ".A1", "1A", ')']).
false.
?- phrase(expression, ['(', "IFF", ".A1", "(1A", ')']).
false.
With an_code as follows you can generate formulas too :
an_code(A) :- var(A) ->
length(C,L), L >= 1,
maplist(alpha_numeric, C),
string_codes(A, C);
atom_codes(A, C), maplist(alpha_numeric, C).
?- phrase(expression, Ls).
Ls = ['(', "IFF", "0", "0", ')'] ;
Ls = ['(', "IFF", "0", "1", ')'] ;
Ls = ['(', "IFF", "0", "2", ')'] ;
?- nth0(1, Ls, "XOR"), phrase(expression, Ls).
Ls = ['(', "XOR", "0", "0", ')'] ;
Ls = ['(', "XOR", "0", "1", ')'] ;
Ls = ['(', "XOR", "0", "2", ')']
?- nth0(1, Ls, "NOT"), phrase(expression, Ls).
Ls = ['(', "NOT", "0", ')'] ;
Ls = ['(', "NOT", "1", ')'] ;
Ls = ['(', "NOT", "2", ')']
In generative version, some predicates used are swi-prolog builtin, so they man not work with other implementations.
A swi-prolog builtin char_type/2 will also work as alpha_numeric char_type(C, alnum). The following is a dcg style code using swi-prolog predicates.
tok -->
[A],
{ string_codes(A, AC),
maplist([C]>>char_type(C, alnum), AC)
}.

Related

How do I merge the content of SPSS variables from different columns

I want to create five new variable K1 K2 K3 K4 K5 where the table below will return the content for each in their order of entry as shown on Fig 2
SN ID1 ID2 ID3 ID4 ID5 IE1 IE2 IE3 IE4 IE5
1 a b c d e
2 b a f c k
Fig 2
SN K1 K2 K3 K4 K5
1 a b c d e
2 b a f c k
Here's a possible way to do it:
(first recreating your example data to demonstrate on:)
data list list/ SN (f1) ID1 to ID5 IE1 to IE5 (10a1).
begin data
1, "a", "b", "c", , , "d", "e", , ,
2, "b", "a", , "f", , "c", "k", , ,
end data.
This is your example data, now you can run the following syntax, which will yield the results you expected:
string K1 to K5 (a1).
vector K=K1 to K5.
compute #x=1.
do repeat id=ID1 to IE5.
do if id<>"".
compute K(#x)=id. /* correction made here .
compute #x=#x+1.
end if.
end repeat.

String pattern recognition without training set

I have multiple strings, which are created based on a few (mostly) known variables and a few unknown templates. I'd like to know what those templates were to extract the variable parts from these strings. After that I can relatively easily infer the meaning of each substring, so only the pattern recognition is the question here. For example:
"76 (q) h"
"a x q y 123"
"c x e y 73"
"3 (e) z"
...
# pattern recognition: examples -> templates
"{1} x {2} y {3}"
"{1} ({2}) {3}"
# clusters based on template type
"{1} x {2} y {3}" -> ["a x q y 123", "c x e y 73", ...]
"{1} ({2}) {3}" -> ["76 (q) h", "3 (e) z", ...]
# inference: substrings -> extracted variables
"76 (q) h" -> ["76", "q", "h"] -> {x: "h", y: "q", z: 76}
"a x q y 123" -> ["a", "q", "123"] -> {x: "a", y: "q", z: 123}
"c x e y 73" -> ["c", "e", "73"] -> {x: "c", y: "e", z: 73}
"3 (e) z" -> ["3", "e", "z"] -> {x: "z", y: "e", z: 3}
I have found a similar question: Intelligent pattern matching in string, but in my case there is no way to train the parser with positives. Any idea how to solve this?
It turned out what I need is called sequential pattern mining. There are many algorithms for example SPADE, PrefixSpan, CloSpan, BIDE, etc. What I need is an algorithm, which works with gaps too, or an algorithm which finds the frequent substrings which I can concatenate with wildcards. Selecting the proper pattern from the found frequent closed patterns is far from obvious, I am still working on it, but I am a lot closer now than 2 months ago.

Match self-define token in PARSE

I am working on a string-transforming problem. The requirement is like this:
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
==>
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}
Note: the '2014-10-09 11:40:44' is transformed to to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss').
My code looks like below:
date: use [digit][
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
parse line [ to date to end]
but I got this error:
** Script error: PARSE - invalid rule or usage of rule: digit
** Where: parse do either either either -apply-
** Near: parse line [to date to end]
I have made some tests:
probe parse "SSS 2016-01-01 00:00:00" [thru 3 "S" space date to end] ;true
probe parse "SSS 2016-01-01 00:00:00" [ to date to end] ; the error above
As the position of date value is not the same in all my data set, how can I reach it and match it and make the corresponding change?
I did as below:
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
d: [2 digit]
parse/all line [some [p1: {'} 4 digit "-" d "-" d " " d ":" d ":" d {'} p2: (insert p2 ")" insert p1 "to_date(" ) | skip]]
>> {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', to_date('2014-10-09 11:40:44'), '', '105210', null)}
TO and THRU have historically not allowed arbitrary rules as their parameters. See #2129:
"The syntax of TO and THRU is currently restricted by design, for really significant performance reasons..."
This was relaxed in Red. So for example, the following will work there:
parse "aabbaabbaabbccc" [
thru [
some "a" (prin "a") some "b" (prin "b") some "c" (prin "c")
]
]
However, it outputs:
abababababc
This shows that it really doesn't have a better answer than just "naively" applying the parse rule iteratively at each step. Looping the PARSE engine is not as efficient as atomically running a TO/THRU for which faster methods of seeking exist (basic string searches, for instance). And the repeated execution of code in parentheses may not line up with what was actually intended.
Still...it seems better to allow it. Then let users worry about when their code is slow and performance tune it if it matters. So odds are that the Ren-C branch of Rebol will align with Red in this respect, and allow arbitrary rules.
I have made it by an indirect way:
date: use [digit][
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
parse line [
thru "(" vals: (
blk: parse/all vals ","
foreach val blk [
if parse val [" '" date "'"][
;probe val
replace line val rejoin [ { to_date(} at val 2 {, 'yyyy-mm-dd hh24:mi:ss')}]
]
]
)
to end
(probe line)
]
The output:
{INSERT INTO `pub_company` VALUES ('1', '0', 'ABC大学', 'B', 'admin', to_date('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}
Here a true Rebol2 solution
line: {INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', '2014-10-09 11:40:44', '', '105210', null)}
date: use [digit space][
space: " "
digit: charset "0123456789"
[4 digit "-" 2 digit "-" 2 digit space 2 digit ":" 2 digit ":" 2 digit]
]
>> parse/all line [ some [ [da: "'" date (insert da "to_date (" ) 11 skip de: (insert de " 'yyyy-mm-dd hh24:mi:ss'), ") ] | skip ] ]
== true
>> probe line
{INSERT INTO `pub_company` VALUES ('1', '0', 'ABC??', 'B', 'admin', to_date ('2014-10-09 11:40:44', 'yyyy-mm-dd hh24:mi:ss'), '', '105210', null)}

How to do a parser in Prolog?

I would like to do a parser in prolog. This one should be able to parse something like this:
a = 3 + (6 * 11);
For now I only have this grammar done. It's working but I would like to improve it in order to have id such as (a..z)+ and digit such as (0..9)+.
parse(-ParseTree, +Program, []):-
parsor(+Program, []).
parsor --> [].
parsor --> assign.
assign --> id, [=], expr, [;].
id --> [a] | [b].
expr --> term, (add_sub, expr ; []).
term --> factor, (mul_div, term ; []).
factor --> digit | (['('], expr, [')'] ; []).
add_sub --> [+] | [-].
mul_div --> [*] | [/].
digit --> [0] | [1] | [2] | [3] | [4] | [5] | [6] | [7] | [8] | [9].
Secondly, I would like to store something in the ParseTree variable in order to print the ParseTree like this:
PARSE TREE:
assignment
ident(a)
assign_op
expression
term
factor
int(1)
mult_op
term
factor
int(2)
add_op
expression
term
factor
left_paren
expression
term
factor
int(3)
sub_op
...
And this is the function I'm going use to print the ParseTree:
output_result(OutputFile,ParseTree):-
open(OutputFile,write,OutputStream),
write(OutputStream,'PARSE TREE:'),
nl(OutputStream),
writeln_term(OutputStream,0,ParseTree),
close(OutputStream).
writeln_term(Stream,Tabs,int(X)):-
write_tabs(Stream,Tabs),
writeln(Stream,int(X)).
writeln_term(Stream,Tabs,ident(X)):-
write_tabs(Stream,Tabs),
writeln(Stream,ident(X)).
writeln_term(Stream,Tabs,Term):-
functor(Term,_Functor,0), !,
write_tabs(Stream,Tabs),
writeln(Stream,Term).
writeln_term(Stream,Tabs1,Term):-
functor(Term,Functor,Arity),
write_tabs(Stream,Tabs1),
writeln(Stream,Functor),
Tabs2 is Tabs1 + 1,
writeln_args(Stream,Tabs2,Term,1,Arity).
writeln_args(Stream,Tabs,Term,N,N):-
arg(N,Term,Arg),
writeln_term(Stream,Tabs,Arg).
writeln_args(Stream,Tabs,Term,N1,M):-
arg(N1,Term,Arg),
writeln_term(Stream,Tabs,Arg),
N2 is N1 + 1,
writeln_args(Stream,Tabs,Term,N2,M).
write_tabs(_,0).
write_tabs(Stream,Num1):-
write(Stream,'\t'),
Num2 is Num1 - 1,
write_tabs(Stream,Num2).
writeln(Stream,Term):-
write(Stream,Term),
nl(Stream).
write_list(_Stream,[]).
write_list(Stream,[Ident = Value|Vars]):-
write(Stream,Ident),
write(Stream,' = '),
format(Stream,'~1f',Value),
nl(Stream),
write_list(Stream,Vars).
I hope someone will be able to help me. Thank you !
Here's an enhancement of your parser as written which can get you started. It's an elaboration of the notions that #CapelliC indicated.
parser([]) --> [].
parser(Tree) --> assign(Tree).
assign([assignment, ident(X), '=', Exp]) --> id(X), [=], expr(Exp), [;].
id(X) --> [X], { atom(X) }.
expr([expression, Term]) --> term(Term).
expr([expression, Term, Op, Exp]) --> term(Term), add_sub(Op), expr(Exp).
term([term, F]) --> factor(F).
term([term, F, Op, Term]) --> factor(F), mul_div(Op), term(Term).
factor([factor, int(N)]) --> num(N).
factor([factor, Exp]) --> ['('], expr(Exp), [')'].
add_sub(Op) --> [Op], { memberchk(Op, ['+', '-']) }.
mul_div(Op) --> [Op], { memberchk(Op, ['*', '/']) }.
num(N) --> [N], { number(N) }.
I might have a couple of niggles in here, but the key elements I've added to your code are:
Replaced digit with num which accepts any Prolog term N for which number(N) is true
Used atom(X) to identify a valid identifier
Added an argument to hold the result of parsing the given expression item
As an example:
| ?- phrase(parser(Tree), [a, =, 3, +, '(', 6, *, 11, ')', ;]).
Tree = [assignment,ident(a),=,[expression,[term,[factor,int(3)]],+,[expression,[term,[factor,[expression,[term,[factor,int(6)],*,[term,[factor,int(11)]]]]]]]]] ? ;
This may not be an ideal representation of the parse tree. It may need some adjustment per your needs, which you can do by modifying what I've shown a little. And then you can write a predicate which formats the parse tree as you like.
You could also consider, instead of a list structure, an embedded Prolog term structure as follows:
parser([]) --> [].
parser(Tree) --> assign(Tree).
assign(assignment(ident(X), '=', Exp)) --> id(X), [=], expr(Exp), [;].
id(X) --> [X], { atom(X) }.
expr(expression(Term)) --> term(Term).
expr(expression(Term, Op, Exp)) --> term(Term), add_sub(Op), expr(Exp).
term(term(F)) --> factor(F).
term(term(F, Op, Term)) --> factor(F), mul_div(Op), term(Term).
factor(factor(int(N))) --> num(N).
factor(factor(Exp)) --> ['('], expr(Exp), [')'].
add_sub(Op) --> [Op], { memberchk(Op, ['+', '-']) }.
mul_div(Op) --> [Op], { memberchk(Op, ['*', '/']) }.
num(N) --> [N], { number(N) }.
Which results in something like this:
| ?- phrase(parser(T), [a, =, 3, +, '(', 6, *, 11, ')', ;]).
T = assignment(ident(a),=,expression(term(factor(int(3))),+,expression(term(factor(expression(term(factor(int(6)),*,term(factor(int(11)))))))))) ? ;
A recursive rule for id//0, made a bit more generic:
id --> [First], {char_type(First,lower)}, id ; [].
Building the tree could be done 'by hand', augmenting each non terminal with the proper term, like
...
assign(assign(Id, Expr)) --> id(Id), [=], expr(Expr), [;].
...
id//0 could become id//1
id(id([First|Rest])) --> [First], {memberchk(First, [a,b])}, id(Rest) ; [], {Rest=[]}.
If you're going to code such parsers frequently, a rewrite rule can be easily implemented...

Prolog - Parsing

I'm new to the language prolog and have been given an assignment regarding parsing in prolog. I need some help in solving the problem.
In the assingment we have the grammar:
Expr ::= + Expr Expr | * Expr Expr | Num | Xer
Xer ::= x | ^ x Num
Num ::= 2 | 3 | .... a Integer (bigger than 1) ...
The token ^ is the same as in math. 5^5 equals 25.
Parse needs to work both ways: a call with an instantiated list to generate an Ast, while
a call with an instantiated Ast should generate similar prefix list.
My assingment says that I need to make a prefix parse that does this:
Example(with the value of Ast removed):
?- parse([+, *, 2, x, ^, x, 5 ], Ast), parse(L, Ast).
X = ...,
L = [+, *, 2, x, ^, x, 5]
I would also like to know how the parse tree will look like.
Prolog has a particular formalism to handle context-free grammars directly: DCGs (Definite Clause Grammars). Your example translates almost immediately into a DCG:
expr --> [+], expr, expr | [*], expr, expr | num | xer.
xer --> [x] | [^], [x], num.
num --> [2] | [3] | [4] | [5].
Now, you already can test sentences:
?- phrase(expr, [+,*,2,x,^,x,5]).
true
; false.
?- phrase(expr, [+,*,*,2,x,^,x,5]).
false.
You can even generate all possible sentences like so:
?- length(L, N), phrase(expr, L).
L = [2], N = 1
; L = [3], N = 1
; ... .
And, finally, you can add the abstract syntax tree to your definition.
expr(plus(A,B)) --> [+], expr(A), expr(B).
expr(mul(A,B)) --> [*], expr(A), expr(B).
expr(Num) --> num(Num).
expr(Xer) --> xer(Xer).
xer(var(x)) --> [x].
xer(pow(var(x),N)) --> [^], [x], num(N).
num(num(2)) --> [2].
num(num(3)) --> [3].
num(num(4)) --> [4].
num(num(5)) --> [5].
So now you can use it as desired:
?- phrase(expr(AST), [+,*,2,x,^,x,5]), phrase(expr(AST),L).
AST = plus(mul(num(2),var(x)),pow(var(x),num(5))),
L = [+,*,2,x,^,x,5]
; false.
Just a nitpick: The interface predicate to DCGs is phrase/2 not parse/2.

Resources