How to create a Recipe / Ingredient parser - parsing

Given a line such as
1 pound of Beef
I want to extract the ingredient. Initially im only interested in the ingredient name.
Ive looked at rubys famous time parser Chronic and like its use of regexs.
def self.scan_for_month_names(token)
scanner = {/^jan\.?(uary)?$/ => :january,
/^feb\.?(ruary)?$/ => :february,
/^mar\.?(ch)?$/ => :march,
/^apr\.?(il)?$/ => :april,
/^may$/ => :may,
/^jun\.?e?$/ => :june,
/^jul\.?y?$/ => :july,
/^aug\.?(ust)?$/ => :august,
/^sep\.?(tember)?$/ => :september,
/^oct\.?(ober)?$/ => :october,
/^nov\.?(ember)?$/ => :november,
/^dec\.?(ember)?$/ => :december}
scanner.keys.each do |scanner_item|
return Chronic::RepeaterMonthName.new(scanner[scanner_item]) if scanner_item =~ token.word
end
return nil
end
However in my case Id probably have to create over 300 regexs for each individual ingredient.
I'd also have to take into account of synonyms such as Cilantro & Corriander Leaf
Ive never done parsing before but is the use of regexs here still the best way to go. I cant think of any other reasonable alternative.

Firstly, I'm assuming that the ingredients don't always take the form of QUANTITY UNIT of INGREDIENT - otherwise, this would be a very trivial task (just copy the substring after of
This is a difficult problem - the solution will not be simple.
I think using regex may not be the best approach here:
As you mention, you'll have to write a lot of expressions for each
ingredient
Your list of possible ingredients will always be limited
by the regex list, and you can't detect new ingredients without
compiling more.
it will be very difficult to parse some ingredients(cheese, 1 pound (parmesan))
I think that natural language processing is the way to go here. You have unstructured input, but in a very restricted context.
Perhaps counter-intuitively, I think the best way to find the ingredient may very well be to not look for it - look for everything else instead. If you assume that a line will always have
a numeral (quantity)
a unit (pounds, teaspoons, etc)
a ingredient
and that it's pretty easy to detect numerals and units, it should be straightforward to recognize those first and then extract the ingredient.
If you use a part-of-speech tagger, it's easy to identify relevant words:
[('1', 'LS'), ('pound', 'NN'), ('of', 'IN'), ('Beef', 'NNP')]
From there, you may want to use a classifier. For that, you'll need to label the ingredients manually on a good quantity of lines (say, hundreds). Some possibly good features to use:
position of the word in the line
presence in a precomputed ingredient dictionary (possibly using some partial string matching metric like Levenshtein's
output of part-of-speech tagger
words immediately before and after (if you have an 'of' before the word, there's a high probability it's a ingredient
I'm sure you'll be able to find countless others after working on a few lines.
Finally, I expect that some lines will be very difficult to work on. 1 pound of parmesan cheese, 1 pound of emmentaler: you'd have to infer that the second ingredient is a cheese, too.
As to software, if you can choose the language to use, python has the fantastic Natural Language Toolkit. I can't vouch for toolkits in other languages, but maybe someone else will.

I think I would start by running a series of regex checks against each line, and adjust the parsed text as you go. For example (pseudocode):
First, check for instruction:
/^(add|fold in|stir in|etc...)/
If you found an instruction, remove it from the line, set a flag, and continue:
instruction = $1
this_line = this_line.substring(instruction.length())
If an instruction was found, check to see if there was a subsequent instruction (like "and cover" or "and set aside")
/\b(and\s)(.*)$/
If found, strip that and insert it before the next line of the recipe
instruction = instruction.substring(0, instuction.length - $1.length - $2.length)
splice $2 into the array of lines immediately following this one
Next, maybe you'll check for a preposition:
/((?in)to\s(.+)/
If found, you might use that to check for equipment names, bowls, measuring cups, etc.
Even if you don't use it, you can probably remove it from the string you're parsing, to improve your matching.
Finally, the real work is done with the text that's left:
Check against /^(\d+\s+(?a\s)?\w+)\s*(?of\s*)?(.+)$/
Which should give you $1 containing the unit of measure and $2 containing the ingredient.
Lather. Rinse. Repeat.
After that, do whatever magic your app does with this information.

First of all, I suggest doing some searching to see if someone else has already created a solution to this problem which is good enough for you to use, rather than reinventing the wheel.
For instance, you may find this project to be interesting. It uses machine learning to attempt to parse ingredient phrases, including type of ingredients and amounts.
Other interesting projects also come up when googling for "ingredient parser".
If you are really determined to write this yourself, then I suggest that you do some research into the category of software tools known as a "parser generator", which is a tool which will allow you to write the language you want to recognize in an abstract form (a "grammar"), and then will generate code in your language of choice which will parse text according to that grammar and will identify specific subconstructs within it it very efficiently (much more efficiently than could be done by hundreds of regular expression matches).
For instance, a grammar used as input to a parser generator might look something like this:
// I am making up the following syntax for demonstration purposes, but it illustrates the
// sort of things that one could specify in a grammar, and is not terribly different from
// the grammar languages that real parser generators use.
//
// Note that everything in the curly braces is code to be inserted into the generated parser.
// Each such code block will be invoked when the preceding parsing rule is matched.
%declare { bool organic=false; bool dried=false; bool smoked=false; }
INGREDIENT ::= "organic" INGREDIENT { organic=true; }
| INGREDIENT "(" "organic" ")" { organic=true; }
| "dried" INGREDIENT { dried=true; }
| "smoked" INGREDIENT { smoked=true; }
| AMOUNT "of" INGREDIENT
| INGREDIENT "(" AMOUNT ")"
| BASE_INGREDIENT
BASE_INGREDIENT ::= ( WORD )* {
doSomethingWithBaseIngredient(organic, dried, smoked, $BASE_INGREDIENT);
}
AMOUNT ::= NUMBER ( VOLUME_UNIT | WEIGHT_UNIT )
VOLUME_UNIT ::= "cup" | "liter"
WEIGHT_UNIT ::= "mg" | "kg" | "pound"
NUMBER ::= [0-9]+
WORD ::= [a-zA-Z]+
... and so forth.
The parser generator, when run, would take this grammar as input, and would generate code in your desired programming language as output. This code would parse input text according to the grammar and would also set variables and/or call functions of yours as desired when certain parsing rules are matched. The parsers generated by such tools often use special parsing techniques (often involving large tables, state machines, and so forth) to parse very efficiently in a single pass without having to do any more work than necessary, and avoiding backtracking when possible.
Some common examples of parser generators are lexx/yacc, bison, and Antlr. Many others exist. (Personally, I have gotten good results with Antlr in the past, and am particularly fond of the fact that it can generate parsers in many different programming languages.) Many of these parser generators are mostly intended for use by compiler writers, but that does not mean that they can't be used for other purposes, such as recognizing the various forms that ingredients in recipes take.
This article provides an overview of parser generators, and this article contains a table of various parser generators and their attributes (output languages, etc.) as well as links on where to find more.

Related

Extracting from .bib file with Raku (previously aka Perl 6)

I have this .bib file for reference management while writing my thesis in LaTeX:
#article{garg2017patch,
title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
journal={Journal of Cosmetic Dermatology},
year={2017},
publisher={Wiley Online Library}
}
#article{hauso2008neuroendocrine,
title={Neuroendocrine tumor epidemiology},
author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
journal={Cancer},
volume={113},
number={10},
pages={2655--2664},
year={2008},
publisher={Wiley Online Library}
}
#article{siperstein1997laparoscopic,
title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
journal={Surgery},
volume={122},
number={6},
pages={1147--1155},
year={1997},
publisher={Elsevier}
}
If anyone wants to know what bib file is, you can find it detailed here.
I'd like to parse this with Perl 6 to extract the key along with the title like this:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Can you please help me to do this, maybe in two ways:
Using basic Perl 6
Using a Perl 6 Grammar
TL;DR
A complete and detailed answer that does just exactly as #Suman asks.
An introductory general answer to "I want to parse X. Can anyone help?"
A one-liner in a shell
I'll start with terse code that's perfect for some scenarios[1], and which someone might write if they're familiar with shell and Raku basics and in a hurry:
> raku -e 'for slurp() ~~ m:g / "#article\{" (<-[,]>+) \, \s+
"title=\{" (<-[}]>+) \} / -> $/ { put "$0: $1\n" }' < derm.bib
This produces precisely the output you specified:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Same single statement, but in a script
Skipping shell escapes and adding:
Whitespace.
Comments.
► use tio.run to run the code below
for slurp() # "slurp" (read all of) stdin and then
~~ m :global # match it "globally" (all matches) against
/ '#article{' (<-[,]>+) ',' \s+ # a "nextgen regex" that uses (`(...)`) to
'title={' (<-[}]>+) '}' / # capture the article id and title and then
-> $/ { put "$0: $1\n" } # for each article, print "article id: title".
Don't worry if the above still seems like pure gobbledygook. Later sections explain the above while also introducing code that's more general, clean, and readable.[2]
Four statements instead of one
my \input = slurp;
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
my \articles = input .match: pattern, :global;
for articles -> $/ { put "$0: $1\n" }
my declares a lexical variable. Raku supports sigils at the start of variable names. But it also allows devs to "slash them out" as I have done.
my \pattern ...
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
I've switched the pattern syntax from / ... / in the original one-liner to rule { ... }. I did this to:
Eliminate the risk of pathological backtracking
Classic regexes risk pathological backtracking. That's fine if you can just kill a program that's gone wild, but click the link to read how bad it can get! 🤪 We don't need backtracking to match the .bib format.
Communicate that the pattern is a rule
If you write a good deal of pattern matching code, you'll frequently want to use rule { ... }. A rule eliminates any risk of the classic regex problem just described (pathological backtracking), and has another superpower. I'll cover both aspects below, after first introducing the adverbs corresponding to those superpowers.
Raku regexes/rules can be (often are) used with "adverbs". These are convenient shortcuts that modify how patterns are applied.
I've already used an adverb in the earlier versions of this code. The "global" adverb (specified using :global or its shorthand alias :g) directs the matching engine to consume all of the input, generating a list of as many matches as it contains, instead of returning just the first match.
While there are shorthand aliases for adverbs, some are used so repeatedly that it's a lot tidier to bundle them up into distinct rule declarators. That's why I've used rule. It bundles up two adverbs appropriate for matching many data formats like .bib files:
:ratchet (alias :r)
:sigspace (alias :s)
Ratcheting (:r / :ratchet) tells the compiler that when an "atom" (a sub-pattern in a rule that is treated as one unit) has matched, there can be no going back on that. If an atom further on in the pattern in the same rule fails, then the whole rule immediately fails.
This eliminates any risk of the "pathological backtracking" discussed earlier.
Significant space handling (:s / :sigspace) tells the compiler that an atom followed by literal spacing that is in the pattern indicates that a "token" boundary pattern, aka ws should be appended to the atom.
Thus this adverb deals with tokenizing. Did you spot that I'd dropped the \s+ from the pattern compared to the original one in the one-liner? That's because :sigspace, which use of rule implies, takes care of that automatically:
say 'x#y x # y' ~~ m:g:s /x\#y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g /x \# y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g:s /x \# y/; # (「x#y」 「x # y」) <-- two matches
You might wonder why I've reverted to using / ... / to show these two examples. Turns out that while you can use rule { ... } with the .match method (described in the next section), you can't use rule with m. No problem; I just used :s instead to get the desired effect. (I didn't bother to use :r for ratcheting because it makes no difference for this pattern/input.)
To round out this dive into the difference between classic regexes (which can also be written regex { ... }) and rule rules, let me mention the other main option: token. The token declarator implies the :ratchet adverb, but not the :sigspace one. So it also eliminates the pathological backtracking risk of a regex (or / ... /) but, just like a regex, and unlike a rule, a token ignores whitespace used by a dev in writing out the rule's pattern.
my \articles = input .match: pattern, :global
This line uses the method form (.match) of the m routine used in the one-liner solution.
The result of a match when :global is used is a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.
for articles -> $/ { put "$0: $1\n" }
This for statement successively binds a Match object corresponding to each of the three articles in your sample file to the symbol $/ inside the code block ({ ... }).
Per Raku doc on $/, "$/ is the match variable, so it usually contains objects of type Match.". It also provides some other conveniences; we take advantage of one of these conveniences related to numbered captures:
The pattern that was matched earlier contained two pairs of parentheses;
The overall Match object ($/) provides access to these two Positional captures via Positional subscripting (postfix []), so within the for's block, $/[0] and $/[1] provide access to the two Positional captures for each article;
Raku aliases $0 to $/[0] (and so on) for convenience, so most devs use the shorter syntax.
Interlude
This would be a good time to take a break. Maybe just a cuppa, or return here another day.
The last part of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above and will show how to extend Raku's parsing to more complex scenarios.
But first...
A "boring" practical approach
I want to parse this with Raku. Can anyone help?
Raku may make writing parsers less tedious than with other tools. But less tedious is still tedious. And Raku parsing is currently slow.
In most cases, the practical answer when you want to parse well known formats and/or really big files is to find and use an existing parser. This might mean not using Raku at all, or using an existing Raku module, or using an existing non-Raku parser in Raku.
A suggested starting point is to search for the file format on modules.raku.org or raku.land. Look for a publicly shared parsing module already specifically packaged for Raku for the given file format. Then do some simple testing to see if you have a good solution.
At the time of writing there are no matches for 'bib'.
Even if you don't know C, there's almost certainly a 'bib' parsing C library already available that you can use. And it's likely to be the fastest solution. It's typically surprisingly easy to use an external library in your own Raku code, even if it's written in another programming language.
Using C libs is done using a feature called NativeCall. The doc I just linked may well be too much or too little, but please feel free to visit the freenode IRC channel #raku and ask for help. (Or post an SO question.) We're friendly folk. :)
If a C lib isn't right for a particular use case, then you can probably still use packages written in some other language such as Perl, Python, Ruby, Lua, etc. via their respective Inline::* language adapters.
The steps are:
Install a package (that's written in Perl, Python or whatever);
Make sure it runs on your system using a compiler of the language it's written for;
Install the appropriate Inline language adapter that lets Raku run packages in that other language;
Use the "foreign" package as if it were a Raku package containing exported Raku functions, classes, objects, values, etc.
(At least, that's the theory. Again, if you need help, please pop on the IRC channel or post an SO question.)
The Perl adapter is the most mature so I'll use that as an example. Let's say you use Perl's Text::BibTex packages and now wish to use Raku with that package. First, setup it up as it's supposed to be per its README. Then, in Raku, write something like:
use Text::BibTeX::BibFormat:from<Perl5>;
...
#blocks = $entry.format;
Explanation of these two lines:
The first line is how you tell Raku that you wish to load a Perl module.
(It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Raku bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)
The last line is just a mechanical Raku translation of the #blocks = $entry->format; line from the SYNOPSIS of the Perl package Text::BibTeX::BibFormat.
A Raku grammar / parser
OK. Enough "boring" practical advice. Let's now try have some fun creating a grammar based Raku parser good enough for the example from your question.
► use glot.io to run the code below
unit grammar bib;
rule TOP { <article>* }
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
rule kv-pairs { <kv-pair>* % ',' }
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
With this grammar in place, we can now write something like:
die "Use CommaIDE?" unless bib .parsefile: 'derm.bib';
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }
to generate exactly the same output as the previous solutions.
When a match or parse fails, by default Raku just returns Nil, which is, well, rather terse feedback.
There are several nice debugging options to figure out what's going on with a regex or grammar, but the best option by far is to use CommaIDE's Grammar-Live-View.
If you haven't already installed and used Comma, you're missing one of the best parts of using Raku. The features built in to the free version of Comma ("Community Edition") include outstanding grammar development / tracing / debugging tools.
Explanation of the 'bib' grammar
unit grammar bib;
The unit declarator is used at the start of a source file to tell Raku that the rest of the file declares a named package of code of a particular type.
The grammar keyword specifies a grammar. A grammar is like a class, but contains named "rules" -- not just named methods, but also named regexs, tokens, and rules. A grammar also inherits a bunch of general purpose rules from a base grammar.
rule TOP {
Unless you specify otherwise, parsing methods (.parse and .parsefile) that are called on a grammar start by calling the grammar's rule named TOP (declared with a rule, token, regex, or method declarator).
As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't risk pathological backtracking.)
But I've used a rule. Like token patterns, rules also avoid the pathological backtracking risk. But, in addition rules interpret some whitespace in the pattern to be significant, in a natural manner. (See this SO answer for precise details.)
rules are typically appropriate towards the top of the parse tree. (Tokens are typically appropriate towards the leaves.)
rule TOP { <article>* }
The space at the end of the rule (between the * and pattern closing }) means the grammar will match any amount of whitespace at the end of the input.
<article> invokes another named rule in this grammar.
Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
If you compare this article pattern with the ones I wrote for the earlier Raku rules based solutions, you'll see various changes:
Rule in original one-liner
Rule in this grammar
Kept pattern as simple as possible.
Introduced <kv-pairs> and closing }
No attempt to echo layout of your input.
Visually echoes your input.
<[...]> is the Raku syntax for a character class, like[...] in traditional regex syntax. It's more powerful, but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in the [^,] syntax of ye olde regex. So <-[,]>+ attempts a match of one or more characters, none of which are ,.
$<id>=<-[,]>+ tells Raku to attempt to match the quantified "atom" on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key <id> within the current match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.
rule kv-pairs { <kv-pair>* % ',' }
This pattern illustrates one of several convenient Raku regex features. It declares you want to match zero or more kv-pairs separated by commas.
(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be clearer.
I could have used the ~ approach in the /.../ regex in the one-liner, and vice-versa. But I wanted this grammar solution to continue the progression toward illustrating "better practice" idioms.
Constructing / deconstructing the parse tree
for $<article> { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
$<article>, $<id> etc. refer to named match objects that are stored somewhere in the "parse tree". But how did they get there? And exactly where is "there"?
Returning to the top of the grammar:
rule TOP {
If a .parse is successful, a single 'TOP' level match object is returned. (After a parse is complete the variable $/ is also bound to that top match object.) During parsing a tree will have been formed by hanging other match objects off this top match object, and then others hung off those, and so on.
Addition of match objects to a parse tree is done by adding either a single generated match object, or a list of them, to either a Positional (numbered) or Associative (named) capture of a "parent" match object. This process is explained below.
rule TOP { <article>* }
<article> invokes a match of the rule named article. An invocation of the rule <article> has two effects:
Raku tries to match the rule.
If it matches, Raku captures that match by generating a corresponding match object and adding it to the parse tree under the key <article> of the parent match object. (In this case the parent is the top match object.)
If the successfully matched pattern had been specified as just <article>, rather than as <article>*, then only one match would have been attempted, and only one value, a single match object, would have been generated and added under the key <article>.
But the pattern was <article>*, not merely <article>. So Raku attempts to match the article rule as many times as it can. If it matches at all, then a list of one or more match objects is stored as the value of the <article> key. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)
$<article> is short for $/<article>. It refers to the value stored under the <article> key of the current match object (which is stored in $/). In this case that value is a list of 3 match objects corresponding to the 3 articles in the input.
rule article { '#article{' $<id>=<-[,]>+ ','
Just as the top match object has several match objects hung off of it (the three captures of article matches that are stored under the top match object's <article> key), so too do each of those three article match objects have their own "child" match objects hanging off of them.
To see how that works, let's consider just the first of the three article match objects, the one corresponding to the text that starts "#article{garg2017patch,...". The article rule matches this article. As it's doing that matching, the $<id>=<-[,]>+ part tells Raku to store the match object corresponding to the id part of the article ("garg2017patch") under that article match object's <id> key.
Hopefully this is enough (quite possibly way too much!) and I can at last exhaustively (exhaustingly?) explain the last line of code, which, once again, was:
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
At the level of the for, the variable $/ refers to the top of the parse tree generated by the parse that just completed. Thus $<article>, which is shorthand for $/<article>, refers to the list of three article match objects.
The for then iterates over that list, binding $/ within the lexical scope of the -> $/ { ... } block to each of those 3 article match objects in turn.
The $<id> bit is shorthand for $/<id>, which inside the block refers to the <id> key within the article match object that $/ has been bound to. In other words, $<id> inside the block is equivalent to $<article><id> outside the block.
The $<kv-pairs><kv-pair>[0]<value> follows the same scheme, albeit with more levels and a positional child (the [0]) in the midst of all the key (named/ associative) children.
(Note that there was no need for the article pattern to include a $<kv-pairs>=<kv-pairs> because Raku just presumes a pattern of the form <foo> should store its results under the key <foo>. If you wish to disable that, write a pattern with a non-alpha character as the first symbol. For example, use <.foo> if you want to have exactly the same matching effect as <foo> but just not store the matched input in the parse tree.)
Phew!
When the automatically generated parse tree isn't what you want
As if all the above were not enough, I need to mention one more thing.
The parse tree strongly reflects the tree structure of the grammar's rules calling one another from the top rule down to leaf rules. But the resulting structure is sometimes inconvenient.
Often one still wants a tree, but a simpler one, or perhaps some non-tree data structure.
The primary mechanism for generating exactly what you want from a parse, when the automatic results aren't suitable, is make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)
In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree, such as an AST.
Footnotes
[1] Basic Raku is good for exploratory programming, spikes, one-offs, PoCs and other scenarios where the emphasis is on quickly producing working code that can be refactored later if need be.
[2] Raku's regexes/rules scale up to arbitrary parsing, as introduced in the latter half of this answer. This contrasts with past generations of regex which could not.[3]
[3] That said, ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ remains a great and relevant read. Not because Raku rules can't parse (X)HTML. In principle they can. But for a task as monumental as correctly handling full arbitrary in-the-wild XHTML I would strongly recommend you use an existing parser written expressly for that purpose. And this applies generally for existing formats; it's best not to reinvent the wheel. But the good news with Raku rules is that if you need to write a full parser, not just a bunch of regexes, you can do so, and it need not involve going insane!

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

YACC|BISON :How do I manipulate parse tree?

The goal of my application is to validate an sql code and generate,in the mean time, from that code a formatted one with some modification.For example this where clause :
where e.student_name= c.contact_name and ( c.address = " nefta"
or c.address=" tozeur ") and e.age <18
we will have as formatted output something like that :
where e.student_name= c.contact_name and (c.address=trim("nefta")
or c.address=trim("tozeur") ) and e.age <18
I hope I've explained my aim well
The problem is grammars may contain recursive rules which make the rewrite task unreliable ; for instance in my sql grammar i have this :
search_condition : search_condition OR search_condition{clbck_or}
| search_condition AND search_condition{clbck_and}
| NOT search_condition {clbck_not}
| '(' search_condition ')'{clbck__}
| predicate {clbck_pre}
;
Knowing that I specified a precedence priority to solve shift reduce problems
%left OR
%left AND
%left NOT
So back on the last example ; my clause where will be consumed this way:
c.address="nefta"or c.address="tozeur" -> search_condition
(c.address="nefta"or c.address="tozeur")->search_condition
e.student_name= c.contact_name and (c.address="nefta"or c.address="tozeur")-> search_condition
... and e.age<18-> search_condition
You can by the way understand that it's tough to rebuild the input stream referring to callbacks triggered by each reduction cause the order is not the same.
Any help for this problem ?
Your question is a bit vague, so I'm guessing that you actually print in your clbck_or (), etc. The "common" way to which wildplasser has alluded is to use "semantic values", i. e. (untested):
search_condition : search_condition OR search_condition{$$ = clbck_or($1, $3);}
| search_condition AND search_condition{$$ = clbck_and($1, $3);}
| NOT search_condition {$$ = clbck_not($2);}
| '(' search_condition ')'{$$ = clbck__($2);}
| predicate {$$ = clbck_pre($1);}
;
If you're using Bison, the manual has a fine example in the section "Infix Notation Calculator: `calc'". With strings and C, you will have to add some memory handling.
Bison is good at parsing, and with some manual help, good at building a custom syntax tree. After that, its up to you to do what you want with the tree. The good news is you can do whatever you want. The bad news is you still have to build a lot of machinery to do what you want. Your basic problem of regenerating source code is called "prettyprinting"; see my SO answer on how to prettyprint to understand what it takes to do this, including all the peccadillos of lexical syntax (you don't to lose the escapes in your literal strings, right?). You didn't at all address how to find the construct you wanted to change in the tree, or how you'd smash the tree to change it.
If you don't want to do all of that, then what you really want is a program transformation system, which is good at parsing, building a syntax tree for you (so you don't have to think about it, SQL is pretty big grammar), will let you find patterns in the tree in terms of SQL syntax you are used to, make tree changes without knowing much about the shape of the tree, and can finally regenerate valid source text by prettyprint as I describe in my answer link above. (A program transformation systems essentially includes a parser as a subroutine).
Our DMS Software Reengineering Toolkit is such a program transformation system. It has a set of predefined language definitions including SQL2011 and means for configuring for a particular dialect.
Using DMS source-to-source syntax rules, you could carry out the change in your example with the following rule:
domain SQL;
rule trim_c_members(f: identifier, s: string):condition->condition
= " c.\f = \s " -> " c.\f = trim(\s) ";
This is DMS Rule language (meta) syntax to describe a rewrite on ("domain") SQL code.
The rule has a name (because in complex application there's lot of rules) and it
as syntactic place holders "f" and "s"; it rewrites only conditions in the code.
The quotes are RSL meta-quotes; stuff inside is SQL with RSL metavariables "\f"
and "\s"; stuff outside is RSL rule syntax. What the rule says is,
"for any condition on a variable explicitly named 'c', with any field f,
if that field is compared by equality to some literal string, then replace
the literal string by 'trim' applied to the literal string".
I left out some code that basically says, "apply this rule to the entire tree, and don't apply it twice in the same place". That "strategy" is one of many built into DMS.
There's the question of how does the rule work. that is accomplished by DMS applying the SQL parser to the meta-quoted strings, to produce "pattern" syntax trees with placeholders where the metavariables are written. The left hand side pattern tree is then matched against the target tree with placeholder referring to subtrees; the right hand tree is spliced in where the left tree matched, and the placeholder subtrees transferred. So, you the programmer see surface sytax that you know and love; the tool works with trees and so it isn't confused by text.
Now, I don't think my rule matches exactly your intent, but that's partly because I can't guess your actual intent. You can write other rules if this isn't what you wanted.
This rule is purely driven by syntax; one can add a semantic predicate (not shown) if you want more complicated conditions to apply to the rule (e.g, the variable has to be ones only in certain scopes you define), and that gets messier to say. But it is much simpler and far easier to read than C code that climbs over the AST (notice you didn't see the AST here?) and tries to figure all this out.
The parsing and prettyprinting happens before and after rule application; there's a lot of machinery required to implement all that, but that machinery is built into DMS (e.g., it has something like [but more powerful] than Bison built in), and for predefined domains such as SQL, all the pretty printing works has been preconfigured, too.
If you want to get a better sense of what it takes to go full cycle with DMS (define your own language parser, define a pretty printer, define complicated rules), here's a nice and complete example of defining and symbolically simplifying calculus using DMS.

What is parsing in terms that a new programmer would understand? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am a college student getting my Computer Science degree. A lot of my fellow students really haven't done a lot of programming. They've done their class assignments, but let's be honest here those questions don't really teach you how to program.
I have had several other students ask me questions about how to parse things, and I'm never quite sure how to explain it to them. Is it best to start just going line by line looking for substrings, or just give them the more complicated lecture about using proper lexical analysis, etc. to create tokens, use BNF, and all of that other stuff? They never quite understand it when I try to explain it.
What's the best approach to explain this without confusing them or discouraging them from actually trying.
I'd explain parsing as the process of turning some kind of data into another kind of data.
In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program.
For example, turning
":Nick!User#Host PRIVMSG #channel :Hello!"
into (C)
struct irc_line {
char *nick;
char *user;
char *host;
char *command;
char **arguments;
char *message;
} sample = { "Nick", "User", "Host", "PRIVMSG", { "#channel" }, "Hello!" }
Parsing is the process of analyzing text made of a sequence of tokens to determine its grammatical structure with respect to a given (more or less) formal grammar.
The parser then builds a data structure based on the tokens. This data structure can then be used by a compiler, interpreter or translator to create an executable program or library.
(source: wikimedia.org)
If I gave you an english sentence, and asked you to break down the sentence into its parts of speech (nouns, verbs, etc.), you would be parsing the sentence.
That's the simplest explanation of parsing I can think of.
That said, parsing is a non-trivial computational problem. You have to start with simple examples, and work your way up to the more complex.
What is parsing?
In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language's grammar). It is an informal name for the syntactic analysis process.
For example, suppose the language a^n b^n (which means same number of characters A followed by the same number of characters B). A parser for that language would accept AABB input and reject the AAAB input. That is what a parser does.
In addition, during this process a data structure could be created for further processing. In my previous example, it could, for instance, to store the AA and BB in two separate stacks.
Anything that happens after it, like giving meaning to AA or BB, or transform it in something else, is not parsing. Giving meaning to parts of an input sequence of tokens is called semantic analysis.
What isn't parsing?
Parsing is not transform one thing into another. Transforming A into B, is, in essence, what a compiler does. Compiling takes several steps, parsing is only one of them.
Parsing is not extracting meaning from a text. That is semantic analysis, a step of the compiling process.
What is the simplest way to understand it?
I think the best way for understanding the parsing concept is to begin with the simpler concepts. The simplest one in language processing subject is the finite automaton. It is a formalism to parsing regular languages, such as regular expressions.
It is very simple, you have an input, a set of states and a set of transitions. Consider the following language built over the alphabet { A, B }, L = { w | w starts with 'AA' or 'BB' as substring }. The automaton below represents a possible parser for that language whose all valid words starts with 'AA' or 'BB'.
A-->(q1)--A-->(qf)
/
(q0)
\
B-->(q2)--B-->(qf)
It is a very simple parser for that language. You start at (q0), the initial state, then you read a symbol from the input, if it is A then you move to (q1) state, otherwise (it is a B, remember the remember the alphabet is only A and B) you move to (q2) state and so on. If you reach (qf) state, then the input was accepted.
As it is visual, you only need a pencil and a piece of paper to explain what a parser is to anyone, including a child. I think the simplicity is what makes the automata the most suitable way to teaching language processing concepts, such as parsing.
Finally, being a Computer Science student, you will study such concepts in-deep at theoretical computer science classes such as Formal Languages and Theory of Computation.
Have them try to write a program that can evaluate arbitrary simple arithmetic expressions. This is a simple problem to understand but as you start getting deeper into it a lot of basic parsing starts to make sense.
Parsing is about READING data in one format, so that you can use it to your needs.
I think you need to teach them to think like this. So, this is the simplest way I can think of to explain parsing for someone new to this concept.
Generally, we try to parse data one line at a time because generally it is easier for humans to think this way, dividing and conquering, and also easier to code.
We call field to every minimum undivisible data. Name is field, Age is another field, and Surname is another field. For example.
In a line, we can have various fields. In order to distinguish them, we can delimit fields by separators or by the maximum length assign to each field.
For example:
By separating fields by comma
Paul,20,Jones
Or by space (Name can have 20 letters max, age up to 3 digits, Jones up to 20 letters)
Paul 020Jones
Any of the before set of fields is called a record.
To separate between a delimited field record we need to delimit record. A dot will be enough (though you know you can apply CR/LF).
A list could be:
Michael,39,Jordan.Shaquille,40,O'neal.Lebron,24,James.
or with CR/LF
Michael,39,Jordan
Shaquille,40,O'neal
Lebron,24,James
You can say them to list 10 nba (or nlf) players they like. Then, they should type them according to a format. Then make a program to parse it and display each record. One group, can make list in a comma-separated format and a program to parse a list in a fixed size format, and viceversa.
Parsing to me is breaking down something into meaningful parts... using a definable or predefined known, common set of part "definitions".
For programming languages there would be keyword parts, usable punctuation sequences...
For pumpkin pie it might be something like the crust, filling and toppings.
For written languages there might be what a word is, a sentence, what a verb is...
For spoken languages it might be tone, volume, mood, implication, emotion, context
Syntax analysis (as well as common sense after all) would tell if what your are parsing is a pumpkinpie or a programming language. Does it have crust? well maybe it's pumpkin pudding or perhaps a spoken language !
One thing to note about parsing stuff is there are usually many ways to break things into parts.
For example you could break up a pumpkin pie by cutting it from the center to the edge or from the bottom to the top or with a scoop to get the filling out or by using a sledge hammer or eating it.
And how you parse things would determine if doing something with those parts will be easy or hard.
In the "computer languages" world, there are common ways to parse text source code. These common methods (algorithims) have titles or names. Search the Internet for common methods/names for ways to parse languages. Wikipedia can help in this regard.
In linguistics, to divide language into small components that can be analyzed. For example, parsing this sentence would involve dividing it into words and phrases and identifying the type of each component (e.g.,verb, adjective, or noun).
Parsing is a very important part of many computer science disciplines. For example, compilers must parse source code to be able to translate it into object code. Likewise, any application that processes complex commands must be able to parse the commands. This includes virtually all end-user applications.
Parsing is often divided into lexical analysis and semantic parsing. Lexical analysis concentrates on dividing strings into components, called tokens, based on punctuationand other keys. Semantic parsing then attempts to determine the meaning of the string.
http://www.webopedia.com/TERM/P/parse.html
Simple explanation: Parsing is breaking a block of data into smaller pieces (tokens) by following a set of rules (using delimiters for example),
so that this data could be processes piece by piece (managed, analysed, interpreted, transmitted, ets).
Examples: Many applications (like Spreadsheet programs) use CSV (Comma Separated Values) file format to import and export data. CSV format makes it possible for the applications to process this data with a help of a special parser.
Web browsers have special parsers for HTML and CSS files. JSON parsers exist. All special file formats must have some parsers designed specifically for them.

Resources