antlr4 assistance - parsing

I'm new to antlr4, and am trying to write a code to look through a .txt and find keywords (set to "PARTY" for testing) and then store everything after, stopping at a new line (excluding the '|' symbol).
I'm running the code in IntelliJ with the antlr4 plugin, and for some reason it's reading the first line, making a parsing tree for it and then stopping.

According to your grammar, each line should start with one or more occurrences of the keyword PARTY, but your first and second line don't start with that. That's why it's complaining about the "missing P".
Another problem is that since you're hiding the NWL tokens, you can't use them in the grammar. If you want newlines to be significant in your grammar, you should not hide them. In other words you should remove the {channel=HIDDEN;} bit.

Related

Antlr: lookahead and lookbehind examples

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.

Stanford CoreNLP merge tokens

I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word> and token "gdin." is split on 2 tokens: <word>gdin</word> and <word>.</word>.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true for now.
Questions:
Do I have to make my own tokenizer, and how? (to merge some tokens like "gdin.")
Are there any settings I missed that could help me with this?
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.

Incremental Parsing from Handle in Haskell

I'm trying to interface Haskell with a command line program that has a read-eval-print loop. I'd like to put some text into an input handle, and then read from an output handle until I find a prompt (and then repeat). The reading should block until a prompt is found, but no longer. Instead of coding up my own little state machine that reads one character at a time until it constructs a prompt, it would be nice to use Parsec or Attoparsec. (One issue is that the prompt changes over time, so I can't just check for a constant string of characters.)
What is the best way to read the appropriate amount of data from the output handle and feed it to a parser? I'm confused because most of the handle-reading primatives require me to decide beforehand how much data I want to read. But it's the parser that should decide when to stop.
You seem to have two questions wrapped up in here. One is about incremental parsing, and one is about incremental reading.
Attoparsec supports incremental parsing directly. See the IResult type in Data.Attoparsec.Text. Parsec, alas, doesn't. You can run your parser on what you have, and if it gives an error, add more input and try again, but you really don't know if the error was an unrecoverable parse error, or just needing for more input.
In your case, usualy REPLs read one line at a time. Hence you can use hGetLine to read a line - pass it to Attoparsec, and if it parses evaluate it, and if not, get another line.
If you want to see all this in action, I do this kind of thing in Plush.Job.Output, but with three small differences: 1) I'm parsing byte streams, not strings. 2) I've set it up to pull as much as is available from the input and parse as many items as I can. 3) I'm reading directly from file descriptos. But the same structure should help you do it in your situation.

xtext - how to set code areas, with excluded grammar check

Summary: I wish to set left and right margin for my own language - areas of code in which grammar check is excluded.
Background: Using xtext, I am trying to create nice Cobol editor. So far I finished grammar and encountered problem with margins and comments.
Left margin I can include within grammar: after ‘\n’ up to 6 !‘\n’ chars.
That is not solving my problem though. SLComment starts with '*' being placed at 7th position from left. I would be able to catch that with ‘\n’ '*' -> ‘\n’ rule, once I somehow exclude first 6 chars in each line.
I can’t just leave it as ‘*’ -> ‘\n’ and delegate position check to validate , because its messing up multiply rule, which of course uses ‘*’. Placing comment rule just after grammar margin rule isn’t also a solve, since that way I can’t catch margin within the first line of the code.
Also I know that I won’t solve right margin problem(exclude area after 78 position for example) using grammar rules.
I guess there is a way to interfere in text that xtext is checking, but haven’t found solution or hint how to accomplish this.
Also tried to find out if this can be made through preprocessing somehow, but also failed to find any hint how to do so.
Or maybe It is possible to use two grammars at once. Extra one will get each line and hide margins?
Hope I were able to describe what problem I’m facing and what I tried so far.
I guess you can use something like a LINE_START_COMMENT terminal rule that captures the first 6 characters of every line, and then hide it accross the grammar.
terminal LINE_START_COMMENT:
// Up to 6 chars that are NOT a new line
'\n' (!'\n')? (!'\n')? (!'\n')? (!'\n')? (!'\n')? (!'\n')?
;
Then, using this approach I guess you should also modify your full line comment rule, to something like:
terminal FULL_LINE_COMMENT:
LINE_START_COMMENT '*'
(LINE_START_COMMENT '*' | !(LINE_START_COMMENT) )+
(LINE_START_COMMENT)?
;
Here I have defined the FULL_LINE_COMMENT to end after the following LINE_START_COMMENT. To check for several consecutive commented lines, you have to accept as a comment anything thats not the comment terminator (in this case the LINE_START_COMMENT) or the LINE_START_COMMENT with a new '*' after it.

Tex command which affects the next complete word

Is it possible to have a TeX command which will take the whole next word (or the next letters up to but not including the next punctuation symbol) as an argument and not only the next letter or {} group?
I’d like to have a \caps command on certain acronyms but don’t want to type curly brackets over and over.
First of all create your command, for example
\def\capsimpl#1{{\sc #1}}% Your main macro
The solution to catch a space or punctuation:
\catcode`\#=11
\def\addtopunct#1{\expandafter\let\csname punct#\meaning#1\endcsname\let}
\addtopunct{ }
\addtopunct{.} \addtopunct{,} \addtopunct{?}
\addtopunct{!} \addtopunct{;} \addtopunct{:}
\newtoks\capsarg
\def\caps{\capsarg{}\futurelet\punctlet\capsx}
\def\capsx{\expandafter\ifx\csname punct#\meaning\punctlet\endcsname\let
\expandafter\capsend
\else \expandafter\continuecaps\fi}
\def\capsend{\expandafter\capsimpl\expandafter{\the\capsarg}}
\def\continuecaps#1{\capsarg=\expandafter{\the\capsarg#1}\futurelet\punctlet\capsx}
\catcode`\#=12
#Debilski - I wrote something similar to your active * code for the acronyms in my thesis. I activated < and then \def<#1> to print the acronym, as well as the expansion if it's the first time it's encountered. I also went a bit off the deep end by allowing defining the expansions in-line and using the .aux files to send the expansions "back in time" if they're used before they're declared, or to report errors if an acronym is never declared.
Overall, it seemed like it would be a good idea at the time - I rarely needed < to be catcode 12 in my actual text (since all my macros were in a separate .sty file), and I made it behave in math mode, so I couldn't foresee any difficulties. But boy was it brittle... I don't know how many times I accidentally broke my build by changing something seemingly unrelated. So all that to say, be very careful activating characters that are even remotely commonly-used.
On the other hand, with XeTeX and higher unicode characters, it's probably a lot safer, and there are generally easy ways to type these extra characters, such as making a multi (or compose) key (I usually map either numlock or one of the windows keys to this), so that e.g. multi-!-! produces ¡). Or if you're running in emacs, you can use C-\ to switch into TeX input mode briefly to insert unicode by typing the TeX command for it (though this is a pain for actually typing TeX documents, since it intercepts your actual \'s, and please please don't try defining your own escape character!)
Regarding whitespace after commands: see package xspace, and TeX FAQ item Commands gobble following space.
Now why this is very difficult: as you noted yourself, things like that can only be done by changing catcodes, it seems. Catcodes are assigned to characters when TeX reads them, and TeX reads one line at a time, so you can not do anything with other spaces on the same line, IMHO. There might be a way around this, but I do not see it.
Dangerous code below!
This code will do what you want only at the end of the line, so if what you want is more "fluent" typing without brackets, but you are willing to hit 'return' after each acronym (and not run any auto-indent later), you can use this:
\def\caps{\begingroup\catcode`^^20 =11\mcaps}
\def\mcaps#1{\def\next##1 {\sc #1##1\catcode`^^20 =10\endgroup\ }\next}
One solution might be setting another character as active and using this one for escaping. This does not remove the need for a closing character but avoids typing the \caps macro, thus making it overall easier to type.
Therefore under very special circumstances, the following works.
\catcode`\*=\active
\def*#1*{\textsc{\MakeTextLowercase{#1}}}
Now follows an *Acronym*.
Unfortunately, this makes uses of \section*{} impossible without additional macro definitions.
In Xetex, it seems to be possible to exploit unicode characters for this, so one could define
\catcode`\•=\active
\def•#1•{\textsc{\MakeTextLowercase{#1}}}
Now follows an •Acronym•.
Which should reduce the effects on other commands but of course needs to have the character ‘•’ mapped to the keyboard somewhere to be of use.

Resources