Is it possible to stem words without using Regex in F#?
I want to know how can I write a F# function which inputs a string and stems it.
eg.
input = "going"
output = "go"
I can't find a way to write the code without using the regex: .*ing\b and replace function which would be almost like doing in C# without any advantage.
Semi pseudo code of what I am trying to write is:
let stemming word =
match word
|(word-"ing")+ing -> (word-"ing")
A quick bit of googling reveals just how complex stemming is:
http://en.wikipedia.org/wiki/Stemming
The standard seems to be the "Porter Algorithm", it seems several people have ported it to .NET, I count two C# versions and a VB.net version on the "The Porter Stemming Algorithm" homepage:
http://tartarus.org/martin/PorterStemmer/
I would use one of these libraries from F# to do the stemming.
Here is a function applying the simplest stemming rule:
let (|Suffix|_|) (suffix: string) (s: string) =
if s.EndsWith(suffix) then
Some(s.Substring(0, s.Length - suffix.Length))
else
None
let stem = function
| Suffix "ing" s -> s
| _ -> failwith "Not ending with ing"
Parameterized active patterns makes pattern matching more readable and more convenient in this case. If stemming rules get complicated, you could update active patterns to keep the stem function unchanged.
Related
The overall type structure and utilization in my current F# is working very well. However, I want to get some perspective if I am doing something incorrectly or following some kind of anti-pattern. I do find myself very often essentially expecting a particular type in particular logic that is pulling from a more general type that is a Discriminated Union unifying a bunch of distinct types that all follow layers of common processing.
Essentially I need particular versions of this function:
'GeneralDiscriminatedUnionType -> 'SpecificCaseType
I find myself repeating many statements like the following:
let checkPromptUpdated (PromptUpdated prompt) = prompt
This is the simplest way that I've found to this; however, every one of these has a valid compiler warning that makes sense that there could be a problem if the function is called with a different type than the expected. This is fair, but I so far have like 40 to 50 of these.
So I started trying the following out, which is actually better, because it would raise a valid exception with incorrect usage (both are the same):
let checkPromptUpdated input = match input with | PromptUpdated prompt -> prompt | _ -> invalidOp "Expecting Prompt"
let checkPromptUpdated = function | PromptUpdated prompt -> prompt | _ -> invalidOp "Expecting Prompt"
However, this looks a lot messier and I'm trying to find out if anyone has any suggestions prior to me doing this messiness all over.
Is there some way to apply this wider logic to a more general function that could then allow me to write this 50 to 100x in a cleaner and more direct and readable way?
This question is just a matter of trying to write cleaner code.
This is an example of a DU that I'm trying to write functions for to be able to pull the particular typed values from the cases:
type StateEvent =
| PromptUpdated of Prompt
| CorrectAnswerUpdated of CorrectAnswer
| DifficultyUpdated of Difficulty
| TagsUpdated of Tag list
| NotesUpdated of Notes
| AuthorUpdated of Author
If the checkPromptUpdated function only works on events that are of the PromptUpdated case, then I think the best design is that the function should be taking just a value of type Prompt (instead of a value of type StateEvent) as an argument:
let checkPromptUpdated prompt =
// do whatever checks you need using 'prompt'
Of course, this means that the pattern matching will get moved from this function to a function that calls it - or further - to a place where you actually receive StateEvent and need to handle all the other cases too. But that is exactly what you want - once you pattern match on the event, you can work with the more specific types like Prompt.
This works for me
let (TypeUWantToExtractFrom unwrappedValue) = wrappedValue
Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160
Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.
I'd like to write a parser for hashtags. I have been reading the blog
entries on parsing on the opa blog, but they didn't cover recursive
parsers and constructions of lists a lot.
Hashtags are used by some social networks (Twitter, Diaspora*)
to tag a post. They consist of a hash sign (#) and an alphanumeric
string such as "interesting" or "funny". One example of a post using
hashtags:
Oh #Opa, you're so #lovely! (Are you a relative of #Haskell?)
Parsing that would result in ["Opa", "lovely", "Haskell"].
I have tried to do it, but it doesn't quite what I want. (It could
either only parse one hashtag and nothing else, would fail in an endless
loop or fail because there was input it didn't understand...)
Additionally, here is a Haskell version that implements it.
To begin with a remark: by posing question in Haskell-terms you're effectively looking for somebody who knows Opa and Haskell hence decreasing chances of finding a person to answer the question ;). Ok, I'm saying it half jokingly as your comments help a lot but still I'd rather see the question phrased in plain English.
I think a solution keeping the structure of the Haskell one would be something like this:
parse_tags =
hashtag = parser "#" tag=Rule.alphanum_string -> tag
notag = parser (!"#" .)* -> void
Rule.parse_list_sep(true, hashtag, notag)
Probably the main 'trick' is to use the Rule.parse_list_sep function to parse a list. I suggest you take a look at the implementation of some functions in the Rule module to get inspiration and learn more about parsing in Opa.
Of course I suggest testing this function, for instance with the following code:
_ =
test(s) =
res =
match Parser.try_parse(parse_tags, s) with
| {none} -> "FAILURE"
| {some=tags} -> "{tags}"
println("Parsing '{s}' -> {res}")
do test("#123 #test #this-is-not-a-single-tag, #lastone")
do test("#how#about#this?")
void
which will give the following output:
Parsing '#123 #test #this-is-not-a-single-tag, #lastone' -> [123, test, this, lastone]
Parsing '#how#about#this?' -> FAILURE
I suspect that you will need to fine tune this solution to really conform to what you want but it should give you a good head start (I hope).
The following work, just using plain parsers
hashtag = parser "#" tag=Rule.alphanum_string -> tag
list_tag = parser
| l=((!"#" .)* hashtag -> hashtag)* .* -> l
parsetag(s) = Parser.parse(list_tag, s)
do println("{parsetag("")}")
do println("{parsetag("aaabbb")}")
do println("{parsetag(" #tag1 #tag2")}")
do println("{parsetag("#tag1 #tag2 ")}")
do println("{parsetag("#tag1#tag2 ")}")
I'm getting stymied by the way "dot notation" works with objects and records when trying to program in a point-free functional style (which I think is a great, concise way to use a functional language that curries by default).
Is there an operator or function I'm missing that lets me do something like:
(.) object method instead of object.method?
(From what I was reading about the new ? operator, I think it works like this. Except it requires definition and gets into the whole dynamic binding thing, which I don't think I need.)
In other words, can I apply a method to its object as an argument like I would apply a normal function to its argument?
Short answer: no.
Longer answer: you can of course create let-bound functions in a module that call a method on a given type... For example in the code
let l = [1;2;3]
let h1 = l.Head
let h2 = List.hd l
there is a sense in which "List.hd" is the version of what you want for ".Head on a list". Or locally, you can always do e.g.
let AnotherWay = (fun (l:list<_>) -> l.Head)
let h3 = AnotherWay l
But there is nothing general, since there is no good way to 'name' an arbitrary instance method on a given type; 'AnotherWay' shows a way to "make a function out of the 'Head' property on a 'list<_>' object", but you need such boilerplate for every instance method you want to treat as a first-class function value.
I have suggested creating a language construct to generalize this:
With regards to language design
suggestions, what if
SomeType..Foo optArgs // note *two* dots
meant
fun (x : SomeType) -> x.Foo optArgs
?
In which case you could write
list<_>..Head
as a way to 'functionize' this instance property, but if we ever do anything in that arena in F#, it would be post-VS2010.
If I understand your question correctly, the answer is: no you can't. Dot (.) is not an operator in F#, it is built into the language, so can't be used as function.