How generate different tokens from same string using GNU Flex - flex-lexer

I started to learn GNU Flex using The Flex Manual Page. I try to use Flex to write a lexer which can tokenize documents and texts.
From this manpage I didn't get how a lexer can return different types of a token from same string.
For example, I have string my_name#email.com. I want to get the following tokens:
email = name#email.com
name = my_name
domain = email.com
Is it possible using GNU Flex?
Update: My task is to tokenize a text string, which consists of various tokens (not only email addresses).

Related

Difference between registerDocumentSemanticTokensProvider and setMonarchTokensProvider in Monaco Editor?

I am new to Monaco Editor, and i found in the official website, if you want to register your custom semantic token highlight, you can do two ways: using the native method registerDocumentSemanticTokensProvider or using the setMonarchTokensProvider provided by Monarch.
So i am wondering is there any difference between these two methods, and in general, which one is better or in other words which one should i use to provide a language's semantic token?
The API setMonarchTokensProvider takes an interface which describes how to tokenize the input (much like what a lexer does in a usual parser/lexer setup, but in a declarative manner, using regular expressions).
Semantic tokens are a step above this, as they describe semantic (additional meaning) for a (lexer) token. As an example: a lexer (or that Monarch token provider) classify input as tokens of type number, string, id etc. The semantic tokens provider can take ids and determine if they actually represent classes, variables etc.

Antlr: common token definitions

I'd like to define common token constants in a single central Antlr file. This way I can define several different lexers and parsers and mix and match them at runtime. If they all share a common set of token definitions, then they'll work fine.
In other words, I want to see public static final int WORD = 2; in each lexer, so they all agree that a "2" is a WORD.
I created a file named CommonTokenDefs.g4 and added a section like this:
tokens {
WORD, NUMBER
}
and included
options { tokenVocab = CommonTokenDefs; }
in each of my other .g4 files. It doesn't work. A .g4 file that includes the tokenVocab will assign a different constant int if it defines a token type, and worse, in its .tokens file it will include duplicate constants!
FOO=1
BAR=2
WORD=1
NUMBER=2
Doing an import CommonTokenDefs; doesn't work either, because if I define a token type in the lexer, and it's already in CommonTokenDefs then I get a "token name FOO is already defined" error.
How do I create a common vocabulary across lexers and parsers?
Including a grammar means to merge it. The imported grammar is not an own instance but instead enriches the grammar it is imported in. And the importing grammar numbers its tokens based on what is defined in it (and adds tokens from the imported grammar).
The only solution I see here is use a single lexer grammar in all your parser, if that is possible. You can implement certain variations in your lexer by using different base lexers (ANTLR option: superClass), but that is of course limited and especially doesn't allow to add more tokens.
Update
Actually, there is a way to make it work as you want it. In addition to the import statement (which is used to import grammars) there is the tokenVocab grammar option, which is used to load a *.tokens file with assignments of number values to tokens. By using such a token vocabulary you could predefine which value ANTLR should use for each token and can hence determine that certain tokens always get the same numeric value. See the generated *.tokens files for the required format.
I use *.tokens files to assign numeric value such that certain keywords are placed in a continuous value range, which allows for efficient checks later, like:
if (token >= KW_1 && token < KW100) ...
which wouldn't be possible if ANTLR would freely assign values to each of the keywords.

How to match a sentence in Lua

I am trying to create a regex which attempts to match a sentence.
Here is a snippet.
local utf8 = require 'lua-utf8'
function matchsent(text)
local text = text
for sent in utf8.gmatch(text, "[^\r\n]+\.[\r\n ]") do
print(sent)
print('-----')
end
end
However, it does not work like in python for example. I know that Lua uses different set of regex patterns and it's regex capabilities are limited but why does the regex above give me a syntax error? And how a sentence matching regex in Lua would look like?
Note that Lua uses Lua patterns, that are not "regular" expressions as they cannot match a regular language. They can hardly be used to split a text into sentences since you'd need to account for various abbreviations, spacing, case etc. To split a text into sentences, you need an NLP package rather than one or two regexps due to the complexity of the task.
Regarding
why does the regex above give me a syntax error?
you need to escape special symbols with a % symbol in Lua patterns. See an example code:
function matchsent(text)
for sent in string.gmatch(text, '[^\r\n]+%.[\r\n ]') do
print(sent)
print("---")
end
end
matchsent("Some text here.\nShow me")
An online demo

Parsing words inside of Lex

I'm new to lex (or flex) and I have a probably simple question. I want to recognize when a user types in "show " and retrieve the name and store it as a variable. Can I do this with some lex keywords or something? Or would just passing it to a method and parsing at the space be easiest?
side note: could include spaces in it
Flex is a tool that is used to create a lexical analyzer. The role of the lexical analyzer, be it generated by Flex or otherwise, is to split the input into tokens. That is, it takes the input stream of characters, s-h-o-w-space, and recognizes that it starts with the token show.
Doing other things, such as storing variable names and values, is better done elsewhere.

Convert Token Numbers to Strings in ANTLR4

I'm trying to use ANTLR4 to build a sort of autocomplete system using the getExpectedTokens() function that can be called when the parser experiences an error. getExpectedTokens() returns an IntervalSet, containing all the token numbers of the acceptable tokens at that point in the parse. Is there some mapping from the token numbers back to the actual Tokens themselves? (So, for example, if one of the expected tokens is a keyword that keyword can be displayed to the user in some way).
These tokens names are accessible through the parser's vocabulary.
parser.getVocabulary().getLiteralName(token_num) will return the string of the literal tokens.
Using getSymbolicName() worked for me.
So you could do parser.getVocabulary().getSymbolicName(tokenType) where tokenType is an int.

Resources