Handling information from a prolog sentence parser - parsing
I have created a sentence parser in prolog. It successfully parses sentences that are inputted with...
?- sentence([input,sentence,here],Parse).
This is the code I'm using to parse the sentence:
np([X|T],np(det(X),NP2),Rem):- /* Det NP2 */
det(X),
np2(T,NP2,Rem).
np(Sentence,Parse,Rem):- np2(Sentence,Parse,Rem). /* NP2 */
np(Sentence,np(NP,PP),Rem):- /* NP PP */
np(Sentence,NP,Rem1),
pp(Rem1,PP,Rem).
np2([H|T],np2(noun(H)),T):- noun(H). /* Noun */
np2([H|T],np2(adj(H),Rest),Rem):- adj(H),np2(T,Rest,Rem).
pp([H|T],pp(prep(H),Parse),Rem):- /* PP NP */
prep(H),
np(T,Parse,Rem).
vp([H| []], vp(verb(H))):- /* Verb */
verb(H).
vp([H|T], vp(verb(H), Rem)):- /* VP PP */
vp(H, Rem),
pp(T, Rem, _).
vp([H|T], vp(verb(H), Rem)):- /* Verb NP */
verb(H),
np(T, Rem, _).
I should mention that the output would be: sentence(np(det(a), np2(adj(very), np2(adj(young), np2(noun(boy))))), vp(verb(loves), np(det(a), np2(adj(manual), np2(noun(problem)))))).
Using the predefined vocabulary: det(a), adj(very), adj(young), noun(boy), verb(loves), det(a), adj(manual), noun(problem).
What I want to do is pass the parsed output to predicates that would separate the words into three different categories, which are "subject, verb, and object".
(1) The subject will hold the first two adjectives and then a noun.
(2) The verb will hold the verb from the "verb phrase".
(3) And the object will hold the adjectives and nouns in the "verb phrase".
All determiners should be ignored.
For example, I would want a predicate that would look for adjectives in the output.
I have tried many things to try and get this working but none work. Any help will be much appreciated.
I am making a second attempt, then.
the output would be: sentence(np(det(a), np2(adj(very),
np2(adj(young), np2(noun(boy))))), vp(verb(loves), np(det(a),
np2(adj(manual), np2(noun(problem)))))). [...] What I want to do is
pass the parsed output to predicates that would separate the words
into three different categories, which are "subject, verb, and
object".
You can write procedures like these, that map from your structures to lists of words.
handle_sent(sentence(NP1,vp(V,NP2)),Subj,Verb,Obj) :-
handle_np(NP1,Subj), handle_verb(V,Verb), handle_np(NP2,Obj).
handle_verb(verb(V),[V]).
handle_np(np(_,np2(adj(A),np2(noun(N)))),[A,N]).
handle_np(np(_,np2(adj(A1),np2(adj(A2),np2(noun(N))))),[A1,A2,N]).
This produces:
?- handle_sent(...,Subj,Verb,Obj).
Subj = [very,young,boy]
Verb = [loves]
Obj = [manual,problem]
The DCG below produces this behaviour:
?- s(Sem,[a,young,boy,loves,a,manual,problem],[]).
Sem = [noun(boy),verb(loves),noun(problem)]
There are some problems in your grammar. Your third np clause calls itself directly (not consuming input in between), which means infinite loop. The grammar you posted does not seem to be able to produce your output (very young). Anyway, here is the DCG:
s(Sem) -->
np(Sem1), vp(Sem2), { append(Sem1,Sem2,Sem) }.
np(Sem) -->
[W], { det(W) },
np2(Sem).
np(Sem) -->
np2(Sem).
np(Sem) -->
np2(Sem1),
pp(Sem2), { append(Sem1,Sem2,Sem) }.
np2([noun(W)]) -->
[W], { noun(W) }.
np2(Sem) -->
[W], { adj(W) },
np2(Sem).
pp(Sem) -->
[W], { prep(W) },
np(Sem).
vp([verb(W)]) -->
[W], { verb(W) }.
vp(Sem) -->
[W], { verb(W) },
np(Sem0), { Sem = [verb(W)|Sem0] }.
vp(Sem) -->
[W], { verb(W) },
pp(Sem0), { Sem = [verb(W)|Sem0] }.
Addition: if you want to handle modification (e.g. adjectives), then there are simple obvious solutions that quickly become impractical, and then there are more general techniques, like adding a logical variable to the np.
np(X,Sem) -->
[W], { det(W) },
np2(X,Sem).
np(X,Sem) -->
np2(X,Sem).
np(X,Sem) -->
np2(X,Sem1),
pp(Sem2), { append(Sem1,Sem2,Sem) }.
np2(X,[noun(X,W)]) -->
[W], { noun(W) }.
np2(X,[adj(X,W)|Sem]) -->
[W], { adj(W) },
np2(X,Sem).
This variable (X) is never instantiated, it only serves to link the parts of the noun phrase meaning together.
?- s(Sem,[a,young,boy,loves,a,manual,problem],[]).
Sem = [adj(_A,young),noun(_A,boy),verb(loves),adj(_B,manual),noun(_B,problem)]
There are various additional possibilities. Good books are Gazdar & Mellish, NLP in Prolog, and Norvig, Paradigms of AI programming (if you speak Lisp), as well as Pereira & Shieber, Prolog and natural-language analysis.
Addition #2: after reading your question again, and this other question, I realised that you actually want three separate lists. No problem.
s(L1,L2,L3) -->
np(_,L1), vp(L2,L3).
np(X,L) -->
[W], { det(W) },
np2(X,L).
np(X,L) -->
np2(X,L).
np(X,L) -->
np2(X,L),
pp(_).
np2(X,[noun(X,W)]) -->
[W], { noun(W) }.
np2(X,[adj(X,W)|L]) -->
[W], { adj(W) },
np2(X,L).
pp(L) -->
[W], { prep(W) },
np(_,L).
vp([verb(W)],[]) -->
[W], { verb(W) }.
vp([verb(W)],L) -->
[W], { verb(W) },
np(_,L).
vp([verb(W)],L) -->
[W], { verb(W) },
pp(L).
Output:
| ?- s(L1,L2,L3,[a,young,boy,loves,a,manual,problem],[]).
L1 = [adj(_A,young),noun(_A,boy)],
L2 = [verb(loves)],
L3 = [adj(_B,manual),noun(_B,problem)] ?
Now maybe you don't need the logical variables, but one the other hand, you could have more complicated modifiers, like "a young boy loves a manual problem involving red bolts and white cubes". Then the variables would keep track of which adjective modifies which noun.
Related
How can I make a time parsing predicate work in both directions?
Using SWI-Prolog I have made this simple predicate that relates a time that is in hh:mm format into a time term. time_string(time(H,M), String) :- number_string(H,Hour), number_string(M,Min), string_concat(Hour,":",S), string_concat(S,Min,String). The predicate though can only work in one direction. time_string(time(10,30),String). String = "10:30". % This is perfect. Unfortunately this query fails. time_string(Time,"10:30"). ERROR: Arguments are not sufficiently instantiated ERROR: In: ERROR: [11] number_string(_8690,_8692) ERROR: [10] time_string(time(_8722,_8724),"10:30") at /tmp/prolcompDJBcEE.pl:74 ERROR: [9] toplevel_call(user:user: ...) at /usr/local/logic/lib/swipl/boot/toplevel.pl:1107 It would be really nice if I didn't have to write a whole new predicate to answer this query. Is there a way I could do this?
Well, going from the structured term time(H,M) to the string String is easier than going from the unstructured String the term time(H,M). Your predicate works in the "generation" direction. For the other direction, you want to parse the String. In this case, this is computationally easy and can be done without search/backtracking, which is nice! Use Prolog's "Definite Clause Grammar" syntax which are "just" a nice way to write predicates that process a "list of stuff". In this case the list of stuff is a list of characters (atoms of length 1). (For the relevant page from SWI-Prolog, see here) With some luck, the DCG code can run backwards/forwards, but this is generally not the case. Real code meeting some demands of efficiency or causality may force it so that under the hood of a single predicate, you first branch by "processing direction", and then run through rather different code structures to deliver the goods. So here. The code immediately "decays" into the parse and generate branches. Prolog does not yet manage to behave fully constraint-based. You just have to do some things before others. Anyway, let's do this: :- use_module(library(dcg/basics)). % --- % "Generate" direction; note that String may be bound to something % in which case this clause also verifies whether generating "HH:MM" % from time(H,M) indeed yields (whatever is denoted by) String. % --- process_time(time(H,M),String) :- integer(H), % Demand that H,M are valid integers inside limits integer(M), between(0,23,H), between(0,59,M), !, % Guard passed, commit to this code branch phrase(time_g(H,M),Chars,[]), % Build Codes from time/2 Term string_chars(String,Chars). % Merge Codes into a string, unify with String % --- % "Parse" direction. % --- process_time(time(H,M),String) :- string(String), % Demand that String be a valid string; no demands on H,M !, % Guard passed, commit to this code branch string_chars(String,Chars), % Explode String into characters phrase(time_p(H,M),Chars,[]). % Parse "Codes" into H and M % --- % "Generate" DCG % --- time_g(H,M) --> hour_g(H), [':'], minute_g(M). hour_g(H) --> { divmod(H,10,V1,V2), digit_int(D1,V1), digit_int(D2,V2) }, digit(D1), digit(D2). minute_g(M) --> { divmod(M,10,V1,V2), digit_int(D1,V1), digit_int(D2,V2) }, digit(D1), digit(D2). % --- % "Parse" DCG % --- time_p(H,M) --> hour_p(H), [':'], minute_p(M). hour_p(H) --> digit(D1), digit(D2), { digit_int(D1,V1), digit_int(D2,V2), H is V1*10+V2, between(0,23,H) }. minute_p(M) --> digit(D1), digit(D2), { digit_int(D1,V1), digit_int(D2,V2), M is V1*10+V2, between(0,59,M) }. % --- % Do I really have to code this? Oh well! % --- digit_int('0',0). digit_int('1',1). digit_int('2',2). digit_int('3',3). digit_int('4',4). digit_int('5',5). digit_int('6',6). digit_int('7',7). digit_int('8',8). digit_int('9',9). % --- % Let's add plunit tests! % --- :- begin_tests(hhmm). test("parse 1", true(T == time(0,0))) :- process_time(T,"00:00"). test("parse 2", true(T == time(12,13))) :- process_time(T,"12:13"). test("parse 1", true(T == time(23,59))) :- process_time(T,"23:59"). test("generate", true(S == "12:13")) :- process_time(time(12,13),S). test("verify", true) :- process_time(time(12,13),"12:13"). test("complete", true(H == 12)) :- process_time(time(H,13),"12:13"). test("bad parse", fail) :- process_time(_,"66:66"). test("bad generate", fail) :- process_time(time(66,66),_). :- end_tests(hhmm). That's a lot of code. Does it work? ?- run_tests. % PL-Unit: hhmm ........ done % All 8 tests passed true.
Given the simplicity of the pattern, a DCG could be deemeed overkill, but actually it provides us an easy access to the atomics ingredients that we can feed into some declarative arithmetic library. For instance :- module(hh_mm_bi, [hh_mm_bi/2 ,hh_mm_bi//1 ]). :- use_module(library(dcg/basics)). :- use_module(library(clpfd)). hh_mm_bi(T,S) :- phrase(hh_mm_bi(T),S). hh_mm_bi(time(H,M)) --> n2(H,23),":",n2(M,59). n2(V,U) --> d(A),d(B), {V#=A*10+B,V#>=0,V#=<U}. d(V) --> digit(D), {V#=D-0'0}. Some tests ?- hh_mm_bi(T,`23:30`). T = time(23, 30). ?- hh_mm_bi(T,`24:30`). false. ?- phrase(hh_mm_bi(T),S). T = time(0, 0), S = [48, 48, 58, 48, 48] ; T = time(0, 1), S = [48, 48, 58, 48, 49] ; ... edit library(clpfd) is not the only choice we have for declarative arithmetic. Here is another shot, using library(clpBNR), but it requires you install the appropriate pack, using ?- pack_install(clpBNR). After this is done, another solution functionally equivalent to the one above could be :- module(hh_mm_bnr, [hh_mm_bnr/2 ,hh_mm_bnr//1 ]). :- use_module(library(dcg/basics)). :- use_module(library(clpBNR)). hh_mm_bnr(T,S) :- phrase(hh_mm_bnr(T),S). hh_mm_bnr(time(H,M)) --> n2(H,23),":",n2(M,59). n2(V,U) --> d(A),d(B), {V::integer(0,U),{V==A*10+B}}. d(V) --> digit(D), {{V==D-0'0}}. edit The comment (now removed) by #DavidTonhofer has made me think that a far simpler approach is available, moving the 'generation power' into d//1: :- module(hh_mm, [hh_mm/2 ,hh_mm//1 ]). hh_mm(T,S) :- phrase(hh_mm(T),S). hh_mm(time(H,M)) --> n2(H,23),":",n2(M,59). n2(V,U) --> d(A),d(B), { V is A*10+B, V>=0, V=<U }. d(V) --> [C], { member(V,[0,1,2,3,4,5,6,7,8,9]), C is V+0'0 }.
time_string(time(H,M),String) :- hour(H) , minute(M) , number_string(H,Hs) , number_string(M,Ms) , string_concat(Hs,":",S) , string_concat(S,Ms,String) . hour(H) :- between(0,11,H) . minute(M) :- between(0,59,M) . /* ?- time_string(time(10,30),B). B = "10:30". ?- time_string(time(H,M),"10:30"). H = 10, M = 30 ; false. ?- time_string(time(H,M),S). H = M, M = 0, S = "0:0" ; H = 0, M = 1, S = "0:1" ; H = 0, M = 2, S = "0:2" ; H = 0, M = 3, S = "0:3" %etc. */
Yet another answer, avoiding DCGs as overkill for this task. Or rather, the two separate tasks involved here: Not every relation can be expressed in a single Prolog predicate, especially not every relation on something as extra-logical as SWI-Prolog's strings. So here is the solution for one of the tasks, computing strings from times (this is your code renamed): time_string_(time(H,M), String) :- number_string(H,Hour), number_string(M,Min), string_concat(Hour,":",S), string_concat(S,Min,String). For example: ?- time_string_(time(11, 59), String). String = "11:59". Here is a simple implementation of the opposite transformation: string_time_(String, time(H, M)) :- split_string(String, ":", "", [Hour, Minute]), number_string(H, Hour), number_string(M, Minute). For example: ?- string_time_("11:59", Time). Time = time(11, 59). And here is a predicate that chooses which of these transformations to use, depending on which arguments are known. The exact condition will depend on the cases that can occur in your application, but it seems reasonable to say that if the string is indeed a string, we want to try to parse it: time_string(Time, String) :- ( string(String) -> % Try to parse the existing string. string_time_(String, Time) ; % Hope that Time is a valid time term. time_string_(Time, String) ). This will translate both ways: ?- time_string(time(11, 59), String). String = "11:59". ?- time_string(Time, "11:59"). Time = time(11, 59).
(f)lex the difference between PRINTA$ and PRINT A$
I am parsing BASIC: 530 FOR I=1 TO 9:C(I,1)=0:C(I,2)=0:NEXT I The patterns that are used in this case are: FOR { return TOK_FOR; } TO { return TOK_TO; } NEXT { return TOK_NEXT; } (many lines later...) [A-Za-z_#][A-Za-z0-9_]*[\$%\!#]? { yylval.s = g_string_new(yytext); return IDENTIFIER; } (many lines later...) [ \t\r\l] { /* eat non-string whitespace */ } The problem occurs when the spaces are removed, which was common in the era of 8k RAM. So the line that is actually found in Super Star Trek is: 530 FORI=1TO9:C(I,1)=0:C(I,2)=0:NEXTI Now I know why this is happening: "FORI" is longer than "FOR", it's a valid IDENTIFIER in my pattern, so it matches IDENTIFIER. The original rule in MS BASIC was that variable names could be only two characters, so there was no * so the match would fail. But this version is also supporting GW BASIC and Atari BASIC, which allow variables with long names. So "FORI" is a legal variable name in my scanner, so that matches as it is the longest hit. Now when I look at the manual, and the only similar example deliberately returns an error. It seems what I need is "match the ID, but only if it's not the same as defined %token", is there such a thing?
It's easy to recognise keywords even if they have an identifier concatenated. What's tricky is deciding in what contexts you should apply that technique. Here's a simple pattern for recognising keywords, using trailing context: tail [[:alnum:]]*[$%!#]? %% FOR/{tail} { return TOK_FOR; } TO/{tail} { return TOK_TO; } NEXT/{tail} { return TOK_NEXT; } /* etc. */ [[:alpha:]]{tail} { /* Handle an ID */ } Effectively, that just extends the keyword match without extending the matched token. But I doubt the problem is so simple. How should FORFORTO be tokenised, for example?
ANTLR4 - parsing regex literals in JavaScript grammar
I'm using ANTLR4 to generate a Lexer for some JavaScript preprocessor (basically it tokenizes a javascript file and extracts every string literal). I used a grammar originally made for Antlr3, and imported the relevant parts (only the lexer rules) for v4. I have just one single issue remaining: I don't know how to handle corner cases for RegEx literals, like this: log(Math.round(v * 100) / 100 + ' msec/sample'); The / 100 + ' msec/ is interpreted as a RegEx literal, because the lexer rule is always active. What I would like is to incorporate this logic (C# code. I would need JavaScript, but simply I don't know how to adapt it): /// <summary> /// Indicates whether regular expression (yields true) or division expression recognition (false) in the lexer is enabled. /// These are mutual exclusive and the decision which is active in the lexer is based on the previous on channel token. /// When the previous token can be identified as a possible left operand for a division this results in false, otherwise true. /// </summary> private bool AreRegularExpressionsEnabled { get { if (Last == null) { return true; } switch (Last.Type) { // identifier case Identifier: // literals case NULL: case TRUE: case FALSE: case THIS: case OctalIntegerLiteral: case DecimalLiteral: case HexIntegerLiteral: case StringLiteral: // member access ending case RBRACK: // function call or nested expression ending case RPAREN: return false; // otherwise OK default: return true; } } } This rule was present in the old grammar as an inline predicate, like this: RegularExpressionLiteral : { AreRegularExpressionsEnabled }?=> DIV RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart* ; But I don't know how to use this technique in ANTLR4. In the ANTLR4 book, there are some suggestions about solving this kind of problems at the parser level (chapter 12.2 - context sensitive lexical problems), but I don't want to use a parser. I want just to extract all the tokens, leave everything untouched except for the string literals, and keep the parsing out of my way. Any suggestion would be really appreciated, thanks!
I'm posting here the final solution, developed adapting the existing one to the new syntax of ANTLR4, and addressing the differences in JavaScript syntax. I'm posting just the relevant parts, to give a clue to someone else about a working strategy. The rule was edited as follows: RegularExpressionLiteral : DIV {this.isRegExEnabled()}? RegularExpressionFirstChar RegularExpressionChar* DIV IdentifierPart* ; The function isRegExEnabled is defined in a #members section on top of the lexer grammar, as follows: #members { EcmaScriptLexer.prototype.nextToken = function() { var result = antlr4.Lexer.prototype.nextToken.call(this, arguments); if (result.channel !== antlr4.Lexer.HIDDEN) { this._Last = result; } return result; } EcmaScriptLexer.prototype.isRegExEnabled = function() { var la = this._Last ? this._Last.type : null; return la !== EcmaScriptLexer.Identifier && la !== EcmaScriptLexer.NULL && la !== EcmaScriptLexer.TRUE && la !== EcmaScriptLexer.FALSE && la !== EcmaScriptLexer.THIS && la !== EcmaScriptLexer.OctalIntegerLiteral && la !== EcmaScriptLexer.DecimalLiteral && la !== EcmaScriptLexer.HexIntegerLiteral && la !== EcmaScriptLexer.StringLiteral && la !== EcmaScriptLexer.RBRACK && la !== EcmaScriptLexer.RPAREN; }} As you can see, two functions are defined, one is an override of lexer's nextToken method, which wraps the existing nextToken and saves the last non-comment-or-whitespace token for reference. Then, the semantic predicate invokes isRegExEnabled checking if the last significative token is compatible with the presence of RegEx literals. If it's not, it returns false. Thanks to Lucas Trzesniewski for the comment: it pointed me in the right direction, and to Patrick Hulsmeijer for the original work on v3.
Why won't my JavaCC lexer/parser accept this input?
I am creating a lexer/parser which should accept strings that belong to an infinite set of languages. One such string is "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>". The set of languages is defined as follows: Base language, L0 A string from L0 consists of several blocks separated by space characters. At least one block must be present. A block is an odd-length sequence of lowercase letters (a-z). No spaces are allowed before the first block or after the last one. The number of spaces between blocks must be odd. Example of string belonging to L0: zyx abcba m xyzvv There is one space character between zyx and abcba, there are three spaces between abcba and m, and only one between m and xyzvv. No other space characters are present in the string. Language L1 A string from L1 consists of several blocks separated by space characters. At least one block must be present. There are two kinds of blocks. A block of the first kind must be an even-length sequence of uppercase letters (A-Z). A block of the second kind must have the shape <2U>. . .</2U>, where . . . stands for any string from L0. No spaces are allowed before the first block or after the last one. The number of spaces between blocks must be odd. Example of string belonging to L1: YZ <2U>abc zzz</2U> ABBA <2U>kkkkk</2U> KM Note that five spaces separate YZ and <2U>abc zzz</2U>, and three spaces divide abc from zzz. Otherwise single spaces are used as separators. There is no space in front of YZ and no space follows KM. Language L2 A string from L2 consists of several blocks separated by space characters. At least one block must be present. There are two kinds of blocks. A block of the first kind must be an odd-length sequence of lowercase letters (a-z). A block of the second kind must have the shape <2L>. . .</2L>, where . . . stands for any string from L1. No spaces are allowed before the first block or after the last one. The number of spaces between blocks must be odd. Example of string belonging to L2: abc <2L>AA ZZ <2U>a bcd</2U></2L> z <2L><2U>abcde</2U></2L> Single spaces are used as separators inside the sentence given above, but any other odd number of spaces would also lead to a valid L2 sentence. Languages L{2k + 1}, k > 0 A string from L{2k + 1} consists of several blocks separated by space characters. At least one block must be present. There are two kinds of blocks. A block of the first kind must be an even-length sequence of uppercase letters (A-Z). A block of the second kind must have the shape <2U>. . .</2U>, where . . . stands for any string from L{2k}. No spaces are allowed before the first block or after the last one. The number of spaces between blocks must be odd. Languages L{2k + 2}, k > 0 A string from L{2k + 2} consists of several blocks separated by space characters. At least one block must be present. There are two kinds of blocks. A block of the first kind must be an odd-length sequence of lowercase letters (a-z). A block of the second kind must have the shape <2L>. . .</2L>, where . . . stands for any string from L{2k + 1}. No spaces are allowed before the first block or after the last one. The number of spaces between blocks must be odd. The code for my lexer/parser is as follows: PARSER_BEGIN(Assignment) /** A parser which determines if user's input belongs to any one of the set of acceptable languages. */ public class Assignment { public static void main(String[] args) { try { Assignment parser = new Assignment(System.in); parser.Start(); System.out.println("YES"); // If the user's input belongs to any of the set of acceptable languages, then print YES. } catch (ParseException e) { System.out.println("NO"); // If the user's input does not belong to any of the set of acceptable languages, then print NO. } } } PARSER_END(Assignment) //** A token which matches any lowercase letter from the English alphabet. */ TOKEN : { < #L_CASE_LETTER: ["a"-"z"] > } //* A token which matches any uppercase letter from the English alphabet. */ TOKEN: { < #U_CASE_LETTER: ["A"-"Z"] > } //** A token which matches an odd number of lowercase letters from the English alphabet. */ TOKEN: { < ODD_L_CASE_LETTER: <L_CASE_LETTER>(<L_CASE_LETTER><L_CASE_LETTER>)* > } //** A token which matches an even number of uppercase letters from the English alphabet. */ TOKEN: { < EVEN_U_CASE_LETTERS: (<U_CASE_LETTER><U_CASE_LETTER>)+ > } //* A token which matches the string "<2U>" . */ TOKEN: { < OPEN_UPPER: "<2U>" > } //* A token which matches the string "</2U>". */ TOKEN: { < CLOSE_UPPER: "</2U>" > } //* A token which matches the string "<2L>". */ TOKEN: { < OPEN_LOWER: "<2L>" > } //* A token which matches the string "</2L>". */ TOKEN: { < CLOSE_LOWER: "</2L>" > } //* A token which matches an odd number of white spaces. */ TOKEN : { < ODD_WHITE_SPACE: " "(" "" ")* > } //* A token which matches an EOL character. */ TOKEN: { < EOL: "\n" | "\r" | "\r\n" > } /** This production matches strings which belong to the base language L^0. */ void Start() : {} { LOOKAHEAD(3) <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)* <EOL> <EOF> | NextLanguage() | LOOKAHEAD(3) NextLanguageTwo() | EvenLanguage() } /** This production matches strings which belong to language L^1. */ void NextLanguage(): {} { (<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF> | (<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* <EOL> <EOF> } /** This production matches either an even number of uppercase letters, or a string from L^0, encased within the tags <2U> and </2U>. */ void UpperOrPseudoStart() : {} { <EVEN_U_CASE_LETTERS> | <OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER> } /** This production matches strings from L^0, in a similar way to Start(); however, the strings that it matches do not have EOL or EOF characters after them. */ void PseudoStart() : {} { <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> <ODD_L_CASE_LETTER>)* } /** This production matches strings which belong to language L^2. */ void NextLanguageTwo() : {} { (<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF> | (<OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER>)+ (<ODD_WHITE_SPACE> LowerOrPseudoNextLanguage())* <EOL> <EOF> } /** This production matches either an odd number of lowercase letters, or a string from L^1, encased within the tags <2L> and </2L>. */ void LowerOrPseudoNextLanguage() : {} { <ODD_L_CASE_LETTER> | <OPEN_LOWER> PseudoNextLanguage() <CLOSE_LOWER> } /** This production matches strings from L^1, in a similar way to NextLanguage(); however, the strings that it matches do not have EOL or EOF characters after them. */ void PseudoNextLanguage() : {} { (<OPEN_UPPER> (PseudoStart()) <CLOSE_UPPER>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* | (<EVEN_U_CASE_LETTERS>)+ (<ODD_WHITE_SPACE> UpperOrPseudoStart())* } /** This production matches strings which belong to any of the languages L^{2k + 2}, where k > 0 (the infinite set of even languages). */ void EvenLanguage() : {} { (<ODD_L_CASE_LETTER>)+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF> | (CommonPattern())+ (<ODD_WHITE_SPACE> EvenLanguageAuxiliary())* <EOL> <EOF> } /** This production is an auxiliary production that helps when parsing strings from any of the even set of languages. */ void EvenLanguageAuxiliary() : {} { CommonPattern() | <ODD_L_CASE_LETTER> } void CommonPattern() : {} { <OPEN_LOWER> <EVEN_U_CASE_LETTERS> <ODD_WHITE_SPACE> <OPEN_UPPER> <ODD_L_CASE_LETTER> (<ODD_WHITE_SPACE> CommonPattern())+ <CLOSE_UPPER> <CLOSE_LOWER> } Several times now, I have inputted the string "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>". However, each time, NO is printed out on the terminal. I have looked through my code carefully several times, checking the order in which I think the input string should be parsed; but, I haven't been able to find any errors in my logic or reasons why the string isn't being accepted. Could I have some suggestions as to why it isn't being accepted, please?
The following steps helped to solve the problem. Run the following code: javacc -debug_parser Assignment.jj javac Assignment*.java Then, run the lexer/parser (by typing java Assignment) and then input the string: "a <2L>AA <2U>a <2L>AA <2U>a</2U></2L></2U></2L>" The resulting trace of parser actions shows that the production NextLangaugeTwo() is called on this string, rather than the desired EvenLanguage() production. Tracing through NextLangaugeTwo() shows that it matches the first eight tokens in the input string. So, using a lookahead of 9, although inefficient, causes the input string to be accepted. That is, modify the Start() production by changing the second lookahead value (just above the call to NextLanguageTwo()) from 3 to 9.
Are any of your inputs being accepted? I have copied your code over to my computer and have found that any correct input (as far as I can tell from the definition of your language), it always outputs 'NO'.
How to get such pattern matching of regular expression in lex
Hi I want to check a specific pattern in regular expression but I'm failed to do that. Input should be like noun wordname:wordmeaning I'm successful getting noun and wordname but couldn't design a pattern for word meaning. My code is : int state; char *meaning; char *wordd; ^verb { state=VERB; } ^adj { state = ADJ; } ^adv { state = ADV; } ^noun { state = NOUN; } ^prep { state = PREP; } ^pron { state = PRON; } ^conj { state = CONJ; } //my try but failed [:\a-z] { meaning=yytext; printf(" Meaning is getting detected %s", meaning); } [a-zA-Z]+ { word=yytext; } Example input: noun john:This is a name Now word should be equal to john and meaning should be equal to This is a name.
Agreeing that lex states (also known as start conditions) are the way to go (odd, but there are no useful tutorials). Briefly: your application can be organized as states, using one for "noun", one for "john" and one for the definition (after the colon). at the top of the lex file, declare the states, e.g., %s TYPE NAME VALUE the capitals are not necessary, but since you are defining constants, that is a good convention. next to the patterns, put those state names in < > brackets to tell lex that the patterns are used only in those states. You can list more than one state, comma-separated, when it matters. But your lex file probably does not need that. one state is predefined: INITIAL. your program switches states using the BEGIN() macro, in actions, e.g., { BEGIN(TYPE); } if your input is well-formed, it's simple: as each "type" is recognized, it begins the NAME state. in the NAME state, your lexer looks for whatever you think a name should be, e.g., <NAME>[[:alpha:]][[:alnum:]]+ { my_name = strdup(yytext); } the name ends with a colon, so <NAME>":" { BEGIN(VALUE); } the value is then everything until the end of the line, e.g., <VALUE>.* { my_value = strdup(yytext); BEGIN(INITIAL); } whether you switch to INITIAL or TYPE depends on what other things you might add to your lexer (such as ignoring comment lines and whitespace). Further reading: Start conditions (flex documentation) Introduction to Flex