I'd like to write a parser for hashtags. I have been reading the blog
entries on parsing on the opa blog, but they didn't cover recursive
parsers and constructions of lists a lot.
Hashtags are used by some social networks (Twitter, Diaspora*)
to tag a post. They consist of a hash sign (#) and an alphanumeric
string such as "interesting" or "funny". One example of a post using
hashtags:
Oh #Opa, you're so #lovely! (Are you a relative of #Haskell?)
Parsing that would result in ["Opa", "lovely", "Haskell"].
I have tried to do it, but it doesn't quite what I want. (It could
either only parse one hashtag and nothing else, would fail in an endless
loop or fail because there was input it didn't understand...)
Additionally, here is a Haskell version that implements it.
To begin with a remark: by posing question in Haskell-terms you're effectively looking for somebody who knows Opa and Haskell hence decreasing chances of finding a person to answer the question ;). Ok, I'm saying it half jokingly as your comments help a lot but still I'd rather see the question phrased in plain English.
I think a solution keeping the structure of the Haskell one would be something like this:
parse_tags =
hashtag = parser "#" tag=Rule.alphanum_string -> tag
notag = parser (!"#" .)* -> void
Rule.parse_list_sep(true, hashtag, notag)
Probably the main 'trick' is to use the Rule.parse_list_sep function to parse a list. I suggest you take a look at the implementation of some functions in the Rule module to get inspiration and learn more about parsing in Opa.
Of course I suggest testing this function, for instance with the following code:
_ =
test(s) =
res =
match Parser.try_parse(parse_tags, s) with
| {none} -> "FAILURE"
| {some=tags} -> "{tags}"
println("Parsing '{s}' -> {res}")
do test("#123 #test #this-is-not-a-single-tag, #lastone")
do test("#how#about#this?")
void
which will give the following output:
Parsing '#123 #test #this-is-not-a-single-tag, #lastone' -> [123, test, this, lastone]
Parsing '#how#about#this?' -> FAILURE
I suspect that you will need to fine tune this solution to really conform to what you want but it should give you a good head start (I hope).
The following work, just using plain parsers
hashtag = parser "#" tag=Rule.alphanum_string -> tag
list_tag = parser
| l=((!"#" .)* hashtag -> hashtag)* .* -> l
parsetag(s) = Parser.parse(list_tag, s)
do println("{parsetag("")}")
do println("{parsetag("aaabbb")}")
do println("{parsetag(" #tag1 #tag2")}")
do println("{parsetag("#tag1 #tag2 ")}")
do println("{parsetag("#tag1#tag2 ")}")
Related
Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160
Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.
Is it possible to stem words without using Regex in F#?
I want to know how can I write a F# function which inputs a string and stems it.
eg.
input = "going"
output = "go"
I can't find a way to write the code without using the regex: .*ing\b and replace function which would be almost like doing in C# without any advantage.
Semi pseudo code of what I am trying to write is:
let stemming word =
match word
|(word-"ing")+ing -> (word-"ing")
A quick bit of googling reveals just how complex stemming is:
http://en.wikipedia.org/wiki/Stemming
The standard seems to be the "Porter Algorithm", it seems several people have ported it to .NET, I count two C# versions and a VB.net version on the "The Porter Stemming Algorithm" homepage:
http://tartarus.org/martin/PorterStemmer/
I would use one of these libraries from F# to do the stemming.
Here is a function applying the simplest stemming rule:
let (|Suffix|_|) (suffix: string) (s: string) =
if s.EndsWith(suffix) then
Some(s.Substring(0, s.Length - suffix.Length))
else
None
let stem = function
| Suffix "ing" s -> s
| _ -> failwith "Not ending with ing"
Parameterized active patterns makes pattern matching more readable and more convenient in this case. If stemming rules get complicated, you could update active patterns to keep the stem function unchanged.
Is there a good reason why the type of Prelude.read is
read :: Read a => String -> a
rather than returning a Maybe value?
read :: Read a => String -> Maybe a
Since the string might fail to be parseable Haskell, wouldn't the latter be be more natural?
Or even an Either String a, where Left would contain the original string if it didn't parse, and Right the result if it did?
Edit:
I'm not trying to get others to write a corresponding wrapper for me. Just seeking reassurance that it's safe to do so.
Edit: As of GHC 7.6, readMaybe is available in the Text.Read module in the base package, along with readEither: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Text-Read.html#v:readMaybe
Great question! The type of read itself isn't changing anytime soon because that would break lots of things. However, there should be a maybeRead function.
Why isn't there? The answer is "inertia". There was a discussion in '08 which got derailed by a discussion over "fail."
The good news is that folks were sufficiently convinced to start moving away from fail in the libraries. The bad news is that the proposal got lost in the shuffle. There should be such a function, although one is easy to write (and there are zillions of very similar versions floating around many codebases).
See also this discussion.
Personally, I use the version from the safe package.
Yeah, it would be handy with a read function that returns Maybe. You can make one yourself:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case reads s of
[(x, "")] -> Just x
_ -> Nothing
Apart from inertia and/or changing insights, another reason might be that it's aesthetically pleasing to have a function that can act as a kind of inverse of show. That is, you want that read . show is the identity (for types which are an instance of Show and Read) and that show . read is the identity on the range of show (i.e. show . read . show == show)
Having a Maybe in the type of read breaks the symmetry with show :: a -> String.
As #augustss pointed out, you can make your own safe read function. However, his readMaybe isn't completely consistent with read, as it doesn't ignore whitespace at the end of a string. (I made this mistake once, I don't quite remember the context)
Looking at the definition of read in the Haskell 98 report, we can modify it to implement a readMaybe that is perfectly consistent with read, and this is not too inconvenient because all the functions it depends on are defined in the Prelude:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case [x | (x,t) <- reads s, ("","") <- lex t] of
[x] -> Just x
_ -> Nothing
This function (called readMaybe) is now in the Haskell prelude! (As of the current base -- 4.6)
I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).