How and what does Firefox (Hunspell) do to clean text before spellchecking words? - firefox-addon

I am trying to clean text in the exact way that Firefox does before spell checking individual words for a Firefox extension I'm building (my addon uses nspell, a JavaScript implementation of Hunspell, since Firefox doesn't expose the Hunspell instance it uses via the extension API).
I've looked at the Firefox gecko cloned codebase, i.e. in the mozSpellChecker.h file and other related files by searching for "spellcheck" but I cannot seem to find out how they are cleaning text.
Reverse engineering it has been a major PITA, I have this so far:
// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
if (!content) {
console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
return ''
}
// ToDo: first split string by spaces in order to properly ignore urls
const rxUrls = /^(http|https|ftp|www)/
const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\\]/
const rxSingleQuotes = /^'+|'+$/g
// split all content by any character that should not form part of a word
return content.split(rxSeparators)
.reduce((acc, string) => {
// remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
string = string.replace(rxSingleQuotes, '')
// we never want empty strings, so skip them
if (string.length < 1) {
return acc
}
// for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
if (!filter) {
return acc.concat([string])
}
// filter out emails, URLs, numbers, and strings less than 2 characters in length
if (!string.includes('#') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
return acc.concat([string])
}
return acc
}, [])
}
But I'm still seeing big differences between content when testing things like - well - the text area used to create this question.
To be clear: I'm looking for the exact method(s) and matches and rules that Firefox uses to clean text, and since it's open source it should be somewhere, but I can't seem to find it!

I believe you want the functions in mozInlineSpellWordUtil.cpp.
From the header:
/**
* This class extracts text from the DOM and builds it into a single string.
* The string includes whitespace breaks whereever non-inline elements begin
* and end. This string is broken into "real words", following somewhat
* complex rules; for example substrings that look like URLs or
* email addresses are treated as single words, but otherwise many kinds of
* punctuation are treated as word separators. GetNextWord provides a way
* to iterate over these "real words".
*
* The basic operation is:
*
* 1. Call Init with the weak pointer to the editor that you're using.
* 2. Call SetPositionAndEnd to to initialize the current position inside the
* previously given range and set where you want to stop spellchecking.
* We'll stop at the word boundary after that. If SetEnd is not called,
* we'll stop at the end of the root element.
* 3. Call GetNextWord over and over until it returns false.
*/
You can find the complete source here, but it is fairly complex. For example, here is the method used to classify parts of the text as email addresses or urls, but it's over 50 lines long just to handle that.
Writing a spell checker seems trivial in principal, but as you can see from the source, it is a major endeavor. I'm not saying you shouldn't try, but as you've likely discovered, the devil is in the details of the edge cases.
Just as one example, when you're deciding what constitutes a word boundary or not, you have to decide which characters to ignore, including characters outside of the ASCII range. For example, here you can see the MONGOLIAN TODO SOFT HYPHEN being handled like the ASCII hyphen character:
// IsIgnorableCharacter
//
// These characters are ones that we should ignore in input.
inline bool IsIgnorableCharacter(char ch) {
return (ch == static_cast<char>(0xAD)); // SOFT HYPHEN
}
inline bool IsIgnorableCharacter(char16_t ch) {
return (ch == 0xAD || // SOFT HYPHEN
ch == 0x1806); // MONGOLIAN TODO SOFT HYPHEN
}
Again, I'm not trying to dissuade you from working on this project, but tokenizing text into discrete words in a way that will work within the context of HTML and in a multilingual environment, is a major endeavor.

Related

Flex scanning, differentiating between string (with single spaces) and padding (more than one space)

I am having trouble with flex to scan lines that looks something like this
DESCRIPTION This is the device description
I would like the line to be scanned such that DESCRIPTION is one token and "This is the device description" is the other.
I have been playing endlessly with my rules but cannot seem to get it to work.
From the documentation I think I want to implement a rule using
`r/s'
an r but only if it is followed by an s
where spaces are only accepted is they are followed by something that is not a while space. I have no idea how to write this rule with flex's syntax. In my mind the rule should be something like
[a-zA-Z](" "/[a-zA-Z0-9]|[a-zA-Z0-9])* return IDENTIFIER;
But this is invalid.
I can get the lines to chop up each word but I cannot get the rules to differentiate between 1 space and 1 < spaces. Halp.
This is not really a good match for flex, since the recognition of tokens is context-dependent. You can achieve context-dependent scanning using start conditions but excessive use of start conditions is often an indication that some other scanning mechanism would be better.
Regardless of how you do it, the key is figuring out exactly how to decide on the token division. Consider the following four lines, for example:
DEVICE This is the device
MODE This is the mode
DESCRIPTION This is the device description
UNDOCUMENTED FIELD
Of course, it is possible that the corner cases represented by the third and fourth lines never show up in any of your inputs.
If the first token cannot include whitespace, then the problem is relatively simple, although you still need a start condition (and I'm going to assume you read the documentation linked above):
%x WHITE WORDS
%%
/* Possibly should be [[:alpha:]] instead of [[:upper:]] */
[[:upper:]]+ { /* copy yytext */; BEGIN(WHITE); return KEYWORD; }
/* Handle other possible line beginnings */
<WHITE>\n { /* Blank descriptive text */; BEGIN(INITIAL); }
<WHITE>[ \t]+ { BEGIN(WORDS); }
<WHITE>. { /* Something not correct in this line */; ... }
<WORDS>.+ { /* copy yytext */; BEGIN(INITIAL); return DESCRIPTION; }
<WORDS>\n { BEGIN(INITIAL); }
If there might be whitespace in the first token but never two spaces in a row, you could replace the first pattern above with:
[[:alpha:]]+( [[:alpha:]]+)*
which will match any sequence of words (consisting only of letters) where there is exactly one space between successive words. Like the original pattern above, this will end on the first non-alphabetic character found. That error will be detected by the rules in <WHITE>, because any non-whitespace character encountered when that start condition becomes active will be handled by the start condition's default rule (the <WHITE>. rule).
My opinion is that you are using the wrong horse here. lex (flex) should be only used for lexical analysis and yacc (or bison) for syntactic one. Saying that one single character is not a separator but multiple are is not appropriate for a lexer.
My opinion is that lex should only reports words and padding and that yacc should later re-combine words that are not separated by padding elements.
The lex part would be as simple as:
[[:alnum:]_]+ {
// printf("WORD: >%s<\n", yytext); // for debugging
return WORD;
}
[[:blank:]]{2,} {
// printf("PADDING: >%s<\n", yytext);
return PADDING;
}
and the yacc part would contain:
elt: PADDING
| ident
ident: WORD
| ident WORD
action are omitted here because they depend too much on your actual processing.

Finding error in JavaCC parser/lexer code

I am writing a JavaCC parser/lexer which is meant to recognise all input strings in the following language L:
A string from L consists of several blocks separated by space characters.
At least one block must be present (i.e., no input consisting only of some number of white spaces is allowed).
A block is an odd-length sequence of lowercase letters (a-z).
No spaces are allowed before the first block or after the last one.
The number of spaces between blocks must be odd.
The I/O specifications include the following specification:
If the input does represent a string from L, then the word YES must be printed out to System.out, ending with the EOL character.
If the input is not in L, then only a single line with the word NO needs
to be printed out to System.out, also ending with the EOL character.
In addition, a brief error message should be printed out on System.err explaining the reason why the input is not in L.
Issue:
This is my current code:
PARSER_BEGIN(Assignment)
/** A parser which determines if user's input belongs to the langauge L. */
public class Assignment {
public static void main(String[] args) {
try {
Assignment parser = new Assignment(System.in);
parser.Input();
if(parser.Input()) {
System.out.println("YES"); // If the user's input belongs to L, print YES.
} else if(!(parser.Input())) {
System.out.println("NO");
System.out.println("Empty input");
}
} catch (ParseException e) {
System.out.println("NO"); // If the user's input does not belong to L, print NO.
}
}
}
PARSER_END(Assignment)
//** A token which matches any lowercase letter from the English alphabet. */
TOKEN :
{
< ID: (["a"-"z"]) >
}
//* A token which matches a single white space. */
TOKEN :
{
<WHITESPACE: " ">
}
/** This production is the basis for the construction of strings which belong to language L. */
boolean Input() :
{}
{
<ID>(<ID><ID>)* ((<WHITESPACE>(<WHITESPACE><WHITESPACE>)*)<ID>(<ID><ID>)*)* ("\n"|"\r") <EOF>
{
System.out.println("ABOUT TO RETURN TRUE");
return true;
}
|
{
System.out.println("ABOUT TO RETURN FALSE");
return false;
}
}
The issue that I am having is as follows:
I am trying to write code which will ensure that:
If the user's input is empty, then the text NO Empty input will be printed out.
If there is a parsing error because the input does not follow the description of L above, then only the text NO will be printed out.
At the moment, when I input the string "jjj jjj jjj", which, by definition, is in L (and I follow this with a carriage return and an EOF [CTRL + D]), the text NO Empty input is printed out.
I did not expect this to happen.
In an attempt to resolve the issue I wrote the ...TRUE and ...FALSE print statements in my production (see code above).
Interestingly enough, I found that when I inputted the same string of js, the terminal printed out the ...TRUE statement once, immediately followed by two occurrences of the ...FALSE statement.
Then the text NO Empty input was printed out, as before.
I have also used Google to try to find out if I am incorrectly using the OR symbol | in my production Input(), or if I am not using the return keyword properly, either. However, this has not helped.
Could I please have hint(s) for resolving this issue?
You're calling the Input method three times. The first time it will read from stdin until it reaches the end of the stream. This will successfully parse the input and return true. The other two times, the stream will be empty, so it will fail and return false.
You shouldn't call a rule multiple times unless you actually want it to be applied multiple times (which only makes sense if the rule only consumes part of the input rather than going until the end of the stream). Instead when you need the result in multiple places, just call the method once and store the result in a variable.
Or in your case you could just call it once in the if and no variable would even be needed:
Assignment parser = new Assignment(System.in);
if(parser.Input()) {
System.out.println("YES"); // If the user's input belongs to L, print YES.
} else {
System.out.println("NO");
System.out.println("Empty input");
}
When the input is jjj jjj jjj followed by a newline or carriage return (but not both), your main method invokes Parser.Input three times.
The first time, your parser consumes all the input and returns true.
The second and third times, all the input having already been consumed, the parser returns false.
Once the input is consumed, the lexer will just keep returning <EOF> tokens.

Regular Expressions in iOS [duplicate]

I'm creating a regexp for password validation to be used in a Java application as a configuration parameter.
The regexp is:
^.*(?=.{8,})(?=..*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*$
The password policy is:
At least 8 chars
Contains at least one digit
Contains at least one lower alpha char and one upper alpha char
Contains at least one char within a set of special chars (##%$^ etc.)
Does not contain space, tab, etc.
I’m missing just point 5. I'm not able to have the regexp check for space, tab, carriage return, etc.
Could anyone help me?
Try this:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
Explanation:
^ # start-of-string
(?=.*[0-9]) # a digit must occur at least once
(?=.*[a-z]) # a lower case letter must occur at least once
(?=.*[A-Z]) # an upper case letter must occur at least once
(?=.*[##$%^&+=]) # a special character must occur at least once
(?=\S+$) # no whitespace allowed in the entire string
.{8,} # anything, at least eight places though
$ # end-of-string
It's easy to add, modify or remove individual rules, since every rule is an independent "module".
The (?=.*[xyz]) construct eats the entire string (.*) and backtracks to the first occurrence where [xyz] can match. It succeeds if [xyz] is found, it fails otherwise.
The alternative would be using a reluctant qualifier: (?=.*?[xyz]). For a password check, this will hardly make any difference, for much longer strings it could be the more efficient variant.
The most efficient variant (but hardest to read and maintain, therefore the most error-prone) would be (?=[^xyz]*[xyz]), of course. For a regex of this length and for this purpose, I would dis-recommend doing it that way, as it has no real benefits.
simple example using regex
public class passwordvalidation {
public static void main(String[] args) {
String passwd = "aaZZa44#";
String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
System.out.println(passwd.matches(pattern));
}
}
Explanations:
(?=.*[0-9]) a digit must occur at least once
(?=.*[a-z]) a lower case letter must occur at least once
(?=.*[A-Z]) an upper case letter must occur at least once
(?=.*[##$%^&+=]) a special character must occur at least once
(?=\\S+$) no whitespace allowed in the entire string
.{8,} at least 8 characters
All the previously given answers use the same (correct) technique to use a separate lookahead for each requirement. But they contain a couple of inefficiencies and a potentially massive bug, depending on the back end that will actually use the password.
I'll start with the regex from the accepted answer:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
First of all, since Java supports \A and \z I prefer to use those to make sure the entire string is validated, independently of Pattern.MULTILINE. This doesn't affect performance, but avoids mistakes when regexes are recycled.
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}\z
Checking that the password does not contain whitespace and checking its minimum length can be done in a single pass by using the all at once by putting variable quantifier {8,} on the shorthand \S that limits the allowed characters:
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])\S{8,}\z
If the provided password does contain a space, all the checks will be done, only to have the final check fail on the space. This can be avoided by replacing all the dots with \S:
\A(?=\S*[0-9])(?=\S*[a-z])(?=\S*[A-Z])(?=\S*[##$%^&+=])\S{8,}\z
The dot should only be used if you really want to allow any character. Otherwise, use a (negated) character class to limit your regex to only those characters that are really permitted. Though it makes little difference in this case, not using the dot when something else is more appropriate is a very good habit. I see far too many cases of catastrophic backtracking because the developer was too lazy to use something more appropriate than the dot.
Since there's a good chance the initial tests will find an appropriate character in the first half of the password, a lazy quantifier can be more efficient:
\A(?=\S*?[0-9])(?=\S*?[a-z])(?=\S*?[A-Z])(?=\S*?[##$%^&+=])\S{8,}\z
But now for the really important issue: none of the answers mentions the fact that the original question seems to be written by somebody who thinks in ASCII. But in Java strings are Unicode. Are non-ASCII characters allowed in passwords? If they are, are only ASCII spaces disallowed, or should all Unicode whitespace be excluded.
By default \s matches only ASCII whitespace, so its inverse \S matches all Unicode characters (whitespace or not) and all non-whitespace ASCII characters. If Unicode characters are allowed but Unicode spaces are not, the UNICODE_CHARACTER_CLASS flag can be specified to make \S exclude Unicode whitespace. If Unicode characters are not allowed, then [\x21-\x7E] can be used instead of \S to match all ASCII characters that are not a space or a control character.
Which brings us to the next potential issue: do we want to allow control characters? The first step in writing a proper regex is to exactly specify what you want to match and what you don't. The only 100% technically correct answer is that the password specification in the question is ambiguous because it does not state whether certain ranges of characters like control characters or non-ASCII characters are permitted or not.
You should not use overly complex Regex (if you can avoid them) because they are
hard to read (at least for everyone but yourself)
hard to extend
hard to debug
Although there might be a small performance overhead in using many small regular expressions, the points above outweight it easily.
I would implement like this:
bool matchesPolicy(pwd) {
if (pwd.length < 8) return false;
if (not pwd =~ /[0-9]/) return false;
if (not pwd =~ /[a-z]/) return false;
if (not pwd =~ /[A-Z]/) return false;
if (not pwd =~ /[%#$^]/) return false;
if (pwd =~ /\s/) return false;
return true;
}
Thanks for all answers, based on all them but extending sphecial characters:
#SuppressWarnings({"regexp", "RegExpUnexpectedAnchor", "RegExpRedundantEscape"})
String PASSWORD_SPECIAL_CHARS = "##$%^`<>&+=\"!ºª·#~%&'¿¡€,:;*/+-.=_\\[\\]\\(\\)\\|\\_\\?\\\\";
int PASSWORD_MIN_SIZE = 8;
String PASSWORD_REGEXP = "^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[" + PASSWORD_SPECIAL_CHARS + "])(?=\\S+$).{"+PASSWORD_MIN_SIZE+",}$";
Unit tested:
Password Requirement :
Password should be at least eight (8) characters in length where the system can support it.
Passwords must include characters from at least two (2) of these groupings: alpha, numeric, and special characters.
^.*(?=.{8,})(?=.*\d)(?=.*[a-zA-Z])|(?=.{8,})(?=.*\d)(?=.*[!##$%^&])|(?=.{8,})(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tested it and it works
For anyone interested in minimum requirements for each type of character, I would suggest making the following extension over Tomalak's accepted answer:
^(?=(.*[0-9]){%d,})(?=(.*[a-z]){%d,})(?=(.*[A-Z]){%d,})(?=(.*[^0-9a-zA-Z]){%d,})(?=\S+$).{%d,}$
Notice that this is a formatting string and not the final regex pattern. Just substitute %d with the minimum required occurrences for: digits, lowercase, uppercase, non-digit/character, and entire password (respectively). Maximum occurrences are unlikely (unless you want a max of 0, effectively rejecting any such characters) but those could be easily added as well. Notice the extra grouping around each type so that the min/max constraints allow for non-consecutive matches. This worked wonders for a system where we could centrally configure how many of each type of character we required and then have the website as well as two different mobile platforms fetch that information in order to construct the regex pattern based on the above formatting string.
This one checks for every special character :
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=\S+$).*[A-Za-z0-9].{8,}$
Java Method ready for you, with parameters
Just copy and paste and set your desired parameters.
If you don't want a module, just comment it or add an "if" as done by me for special char
//______________________________________________________________________________
/**
* Validation Password */
//______________________________________________________________________________
private static boolean validation_Password(final String PASSWORD_Arg) {
boolean result = false;
try {
if (PASSWORD_Arg!=null) {
//_________________________
//Parameteres
final String MIN_LENGHT="8";
final String MAX_LENGHT="20";
final boolean SPECIAL_CHAR_NEEDED=true;
//_________________________
//Modules
final String ONE_DIGIT = "(?=.*[0-9])"; //(?=.*[0-9]) a digit must occur at least once
final String LOWER_CASE = "(?=.*[a-z])"; //(?=.*[a-z]) a lower case letter must occur at least once
final String UPPER_CASE = "(?=.*[A-Z])"; //(?=.*[A-Z]) an upper case letter must occur at least once
final String NO_SPACE = "(?=\\S+$)"; //(?=\\S+$) no whitespace allowed in the entire string
//final String MIN_CHAR = ".{" + MIN_LENGHT + ",}"; //.{8,} at least 8 characters
final String MIN_MAX_CHAR = ".{" + MIN_LENGHT + "," + MAX_LENGHT + "}"; //.{5,10} represents minimum of 5 characters and maximum of 10 characters
final String SPECIAL_CHAR;
if (SPECIAL_CHAR_NEEDED==true) SPECIAL_CHAR= "(?=.*[##$%^&+=])"; //(?=.*[##$%^&+=]) a special character must occur at least once
else SPECIAL_CHAR="";
//_________________________
//Pattern
//String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
final String PATTERN = ONE_DIGIT + LOWER_CASE + UPPER_CASE + SPECIAL_CHAR + NO_SPACE + MIN_MAX_CHAR;
//_________________________
result = PASSWORD_Arg.matches(PATTERN);
//_________________________
}
} catch (Exception ex) {
result=false;
}
return result;
}
Also You Can Do like This.
public boolean isPasswordValid(String password) {
String regExpn =
"^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}$";
CharSequence inputStr = password;
Pattern pattern = Pattern.compile(regExpn,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches())
return true;
else
return false;
}
Use Passay library which is powerful api.
I think this can do it also (as a simpler mode):
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])[^\s]{8,}$
[Regex Demo]
easy one
("^ (?=.* [0-9]) (?=.* [a-z]) (?=.* [A-Z]) (?=.* [\\W_])[\\S]{8,10}$")
(?= anything ) ->means positive looks forward in all input string and make sure for this condition is written .sample(?=.*[0-9])-> means ensure one digit number is written in the all string.if not written return false
.
(?! anything ) ->(vise versa) means negative looks forward if condition is written return false.
close meaning ^(condition)(condition)(condition)(condition)[\S]{8,10}$
String s=pwd;
int n=0;
for(int i=0;i<s.length();i++)
{
if((Character.isDigit(s.charAt(i))))
{
n=5;
break;
}
else
{
}
}
for(int i=0;i<s.length();i++)
{
if((Character.isLetter(s.charAt(i))))
{
n+=5;
break;
}
else
{
}
}
if(n==10)
{
out.print("Password format correct <b>Accepted</b><br>");
}
else
{
out.print("Password must be alphanumeric <b>Declined</b><br>");
}
Explanation:
First set the password as a string and create integer set o.
Then check the each and every char by for loop.
If it finds number in the string then the n add 5. Then jump to the
next for loop. Character.isDigit(s.charAt(i))
This loop check any alphabets placed in the string. If its find then
add one more 5 in n. Character.isLetter(s.charAt(i))
Now check the integer n by the way of if condition. If n=10 is true
given string is alphanumeric else its not.
Sample code block for strong password:
(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[^a-zA-Z0-9])(?=\\S+$).{6,18}
at least 6 digits
up to 18 digits
one number
one lowercase
one uppercase
can contain all special characters
RegEx is -
^(?:(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*)[^\s]{8,}$
at least 8 digits {8,}
at least one number (?=.*\d)
at least one lowercase (?=.*[a-z])
at least one uppercase (?=.*[A-Z])
at least one special character (?=.*[##$%^&+=])
No space [^\s]
A more general answer which accepts all the special characters including _ would be slightly different:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[\W|\_])(?=\S+$).{8,}$
The difference (?=.*[\W|\_]) translates to "at least one of all the special characters including the underscore".

Parsing RTF non-breaking space

I am making a simple parser from RTF to HTML.
I have the following raw RTF:
who\\~nursed\\~and
According to the RTF specification \~ is the keyword for a non-breaking space.
The end of a keyword is marked by a Delimiter which is defined as follows:
A space. This serves only to delimit a control word and is ignored in subsequent processing.
A numeric digit or an ASCII minus sign (-), which indicates that a numeric parameter is associated with the control word. The subsequent digital sequence is then delimited by any character other than an ASCII digit (commonly another control word that begins with a backslash). The parameter can be a positive or negative decimal number. The range of the values for the number is nominally –32768 through 32767, i.e., a signed 16-bit integer. A small number of control words take values in the range −2,147,483,648 to 2,147,483,647 (32-bit signed integer). These control words include \binN, \revdttmN, \rsidN related control words and some picture properties like \bliptagN. Here N stands for the numeric parameter. An RTF parser must allow for up to 10 digits optionally preceded by a minus sign. If the delimiter is a space, it is discarded, that is, it’s not included in subsequent processing.
Any character other than a letter or a digit. In this case, the delimiting character terminates the control word and is not part of the control word. Such as a backslash “\”, which means a new control word or a control symbol follows.
As i understand it, the highlighted part above, is the rule used in this particular instance. But if that is the case, then my parser would read until the ~ sign, and conclude that since this is not a letter or a digit, it is not part of the keyword.
This currently results in the following output:
who~nursed~and
I have the following code for reading a keyword:
public GetKeyword(index: number): KeywordSet {
var keywordarray: string[] = [];
var valuearray: string[] = [];
index++;
while (index < this.m_input.length) {
var remainint = this.m_input.substr(index);
//Keep going until we hit a delimiter
if (this.m_input[index] == " ") {
index++;
break;
} else if (this.IsNumber(this.m_input[index])) {
valuearray.push(this.m_input[index]);
} else if (this.IsDelimiter(this.m_input[index])) {
break;
} else keywordarray.push(this.m_input[index]);
index++;
}
var value: number = null;
if (valuearray.length > 0) value = parseInt(valuearray.join(""));
var keywordset = new KeywordSet(keywordarray.join(""), index, value);
return keywordset;
}
private IsDelimiter(char: string): boolean {
if (char == "*" || char == "'") return false;
return !this.IsLetterOrDigit(char);
}
When GetKeyword() reaches "~" it recognises it as a delimiter, and stops reading, resulting in an empty keyword as return value.
I do not have an AST constructed for this. Don't think it is necessary for this?
The quote in your question describes the syntax of an entity called control word but the \~ is actually a different entity called control symbol. Control symbols have a different syntax:
Control Symbol
A control symbol consists of a backslash followed by a single, non-alphabetical character. For example, \~ (backslash tilde) represents a non-breaking space. Control symbols do not have delimiters, i.e., a space following a control symbol is treated as text, not a delimiter.
See page 9 of Rich Text Format (RTF) Specification, version 1.9.1.

Is it legitimate for a tokenizer to have a stack?

I have designed a new language for that I want to write a reasonable lexer and parser.
For the sake of brevity, I have reduced this language to a minimum so that my questions are still open.
The language has implicit and explicit strings, arrays and objects. An implicit string is just a sequence of characters that does not contain <, {, [ or ]. An explicit string looks like <salt<text>salt> where salt is an arbitrary identifier (i.e. [a-zA-Z][a-zA-Z0-9]*) and text is an arbitrary sequence of characters that does not contain the salt.
An array starts with [, followed by objects and/or strings and ends with ].
All characters within an array that don't belong to an array, object or explicit string do belong to an implicit string and the length of each implicit string is maximal and greater than 0.
An object starts with { and ends with } and consists of properties. A property starts with an identifier, followed by a colon, then optional whitespaces and then either an explicit string, array or object.
So the string [ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ] represents an array with 6 items: " name:test ", "<html>[]</html>", " ", { name: "test" }, "bla" and " " (the object is notated in json).
As one can see, this language is not context free due to the explicit string (that I don't want to miss). However, the syntax tree is nonambiguous.
So my question is: Is a property a token that may be returned by a tokenizer? Or should the tokenizer return T_identifier, T_colon when he reads an object property?
The real language allows even prefixes in the identifier of a property, e.g. ns/name:<a<test>a> where ns is the prefix for a namespace.
Should the tokenizer return T_property_prefix("ns"), T_property_prefix_separator, T_property_name("name"), T_property_colon or just T_property("ns/name") or even T_identifier("ns"), T_slash, T_identifier("name"), T_colon?
If the tokenizer should recognize properties (which would be useful for syntax highlighters), he must have a stack, because name: is not a property if it is in an array. To decide whether bla: in [{foo:{bar:[test:baz]} bla:{}}] is a property or just an implicit string, the tokenizer must track when he enters and leave an object or array.
Thus, the tokenizer would not be a finite state machine any more.
Or does it make sense to have two tokenizers - the first, which separates whitespaces from alpha-numerical character sequences and special characters like : or [, the second, which uses the first to build more semantical tokens? The parser could then operate on top of the second tokenizer.
Anyways, the tokenizer must have an infinite lookahead to see when an explicit string ends. Or should the detection of the end of an explicit string happen inside the parser?
Or should I use a parser generator for my undertaking? Since my language is not context free, I don't think there is an appropriate parser generator.
Thanks in advance for your answers!
flex can be requested to provide a context stack, and many flex scanners use this feature. So, while it may not fit with a purist view of how a scanner scans, it is a perfectly acceptable and supported feature. See this chapter of the flex manual for details on how to have different lexical contexts (called "start conditions"); at the very end is a brief description of the context stack. (Don't miss the sentence which notes that you need %option stack to enable the stack.) [See Note 1]
Slightly trickier is the requirement to match strings with variable end markers. flex does not have any variable match feature, but it does allow you to read one character at a time from the scanner input, using the function input(). That's sufficient for your language (at least as described).
Here's a rough outline of a possible scanner:
%option stack
%x SC_OBJECT
%%
/* initial/array context only */
[^][{<]+ yylval = strdup(yytext); return STRING;
/* object context only */
<SC_OBJECT>{
[}] yy_pop_state(); return '}';
[[:alpha:]][[:alnum:]]* yylval = strdup(yytext); return ID;
[:/] return yytext[0];
[[:space:]]+ /* Ignore whitespace */
}
/* either context */
<*>{
[][] return yytext[0]; /* char class with [] */
[{] yy_push_state(SC_OBJECT); return '{';
"<"[[:alpha:]][[:alnum:]]*"<" {
/* We need to save a copy of the salt because yytext could
* be invalidated by input().
*/
char* salt = strdup(yytext);
char* saltend = salt + yyleng;
char* match = salt;
/* The string accumulator code is *not* intended
* to be a model for how to write string accumulators.
*/
yylval = NULL;
size_t length = 0;
/* change salt to what we're looking for */
*salt = *(saltend - 1) = '>';
while (match != saltend) {
int ch = input();
if (ch == EOF) {
yyerror("Unexpected EOF");
/* free the temps and do something */
}
if (ch == *match) ++match;
else if (ch == '>') match = salt + 1;
else match = salt;
/* Don't do this in real code */
yylval = realloc(yylval, ++length);
yylval[length - 1] = ch;
}
/* Get rid of the terminator */
yylval[length - yyleng] = 0;
free(salt);
return STRING;
}
. yyerror("Invalid character in object");
}
I didn't test that thoroughly, but here is what it looks like with your example input:
[ name:test <xml<<html>[]</html>>xml> {name:<b<test>b>}<b<bla>b> ]
Token: [
Token: STRING: -- name:test --
Token: STRING: --<html>[]</html>--
Token: STRING: -- --
Token: {
Token: ID: --name--
Token: :
Token: STRING: --test--
Token: }
Token: STRING: --bla--
Token: STRING: -- --
Token: ]
Notes
In your case, unless you wanted to avoid having a parser, you don't actually need a stack since the only thing that needs to be pushed onto the stack is an object context, and a stack with only one possible value can be replaced with a counter.
Consequently, you could just remove the %option stack and define a counter at the top of the scan. Instead of pushing the start condition, you increment the counter and set the start condition; instead of popping, you decrement the counter and reset the start condition if it drops to 0.
%%
/* Indented text before the first rule is inserted at the top of yylex */
int object_count = 0;
<*>[{] ++object_count; BEGIN(SC_OBJECT); return '{';
<SC_OBJECT[}] if (!--object_count) BEGIN(INITIAL); return '}'
Reading the input one character at a time is not the most efficient. Since in your case, a string terminate must start with >, it would probably be better to define a separate "explicit string" context, in which you recognized [^>]+ and [>]. The second of these would do the character-at-a-time match, as with the above code, but would terminate instead of looping if it found a non-matching character other than >. However, the simple code presented may turn out to be fast enough, and anyway it was just intended to be good enough to do a test run.
I think the traditional way to parse your language would be to have the tokenizer return T_identifier("ns"), T_slash, T_identifier("name"), T_colon for ns/name:
Anyway, I can see three reasonable ways you could implement support for your language:
Use lex/flex and yacc/bison. The tokenizers generated by lex/flex do not have stack so you should be using T_identifier and not T_context_specific_type. I didn't try the approach so I can't give a definite comment on whether your language could be parsed by lex/flex and yacc/bison. So, my comment is try it to see if it works. You may find information about the lexer hack useful: http://en.wikipedia.org/wiki/The_lexer_hack
Implement a hand-built recursive descent parser. Note that this can be easily built without separate lexer/parser stages. So, if the lexemes depend on context it is easy to handle when using this approach.
Implement your own parser generator which turns lexemes on and off based on the context of the parser. So, the lexer and the parser would be integrated together using this approach.
I once worked for a major network security vendor where deep packet inspection was performed by using approach (3), i.e. we had a custom parser generator. The reason for this is that approach (1) doesn't work for two reasons: firstly, data can't be fed to lex/flex and yacc/bison incrementally, and secondly, HTTP can't be parsed by using lex/flex and yacc/bison because the meaning of the string "HTTP" depends on its location, i.e. it could be a header value or the protocol specifier. The approach (2) didn't work because data can't be fed incrementally to recursive descent parsers.
I should add that if you want to have meaningful error messages, a recursive descent parser approach is heavily recommended. My understanding is that the current version of gcc uses a hand-built recursive descent parser.

Resources