Java CUP and JFlex Interaction - parsing

I am considering to use the CUP parser generator for a project. In order to correctly parse some constructs of the language I am going to be compiling, I will need the lexer (generated by JFlex) to use information from the symbol table (not parse table -- I mean the table in which I will be storing information about identifiers) of the parser to generate the correct token type when its next_token() method is invoked. Since information in the symbol table depends statically on the program text, this will only work if the next_token() method is invoked "in lockstep" with the parser. In other words, this will work if the parser calls the lexer whenever it needs another token, but not if (for example) there is a parellel thread that is invoking the lexer and buffering tokens in a queue.
The question is thus: How does CUP call the lexer? Does it call it whenever it needs the next token? I could of course just write a CUP grammar specification and inspect the generated parser's source file to see what's going on, but that may be more work than necessary. I couldn't find any information on this on relevant websites.
Thanks a lot for any help you can offer!

I finished implementing my parser and scanner a while ago. Here's what I found:
CUP does indeed invoke the scanner as and when needed. It has always buffered one more token ahead of what has been recognized so far (the lookahead token). There is no fancy buffering of tokens ahead of time.
That being said, it can be tricky to set lexer states during parsing, as this can give rise to many grammar conflicts. I guess this is to do with the way CUP represents semantic actions embedded within productions. This forced me to abandon my initial design nonetheless, but not for the reason I was dreading.
Hope this helps someone!

Maybe this reply could be too late for you, but it could be useful for other users. The first thing to know is that a Parser couldn't do anything without a Scanner. As a matter of fact, the first parameter of the constructor of the parser is the scanner.
After the compilation of the .cup file, you will have, as output, a .java file that has the same name of the .cup one. Let's suppose its name is Parser.
So in the main class of your project you have to add the following lines:
TmpParser p = new TmpParser (new Scanner (new Reader (s)));
p.parse();
You should post this code into a try-catch block. With the method parse, the Parser starts its action and also it calls the next_token method of the Scanner, in order to recognize the token and verify if the grammar rules you wrote are right or not.

I don't know how late I'm to answer this question,
But I'm building 1 parser as a part of my course work..
I'm Using Lex and CUP for lexer and Parser, respectively. I'm also including my main class which calls parser which scans as in when required on get Token call
So My driver class will be :
// construct the lexer,
Yylex lexer = new Yylex(new FileReader(filename));
// create the parser
Parser parser = new Parser(lexer);
// and parse
Parser intern calls:
Parser.parse() {
...
this.cur_token = this.scan();
...
}
public Symbol scan() throws Exception {
Symbol sym = this.getScanner().next_token();
return sym != null ? sym : this.getSymbolFactory().newSymbol("END_OF_FILE", this.EOF_sym());
}
parser.parse();

Related

Identifier terminal except certain keywords

I'm using Irony framework and I have:
IdentifierTerminal variable = new IdentifierTerminal("variable");
a terminal for identifying an entry terminal.
This variable terminal can hold any string, except for a predefined list of reserved strings.
This identifier does not start or any with quotes or double quotes.
I want something like:
IdentifierTerminal variable = any contiguos string EXCEPT "event", "delegate";
How can I enforce this rule for this terminal?
Have you declared your keywords explicitly? If not, this page: https://en.wikibooks.org/wiki/Irony_-_Language_Implementation_Kit/Grammar/Terminals#Keywords will show you how. It seems that you don't need to explicitly say that an identifier cannot be a keyword, as the parser is able to figure this out. I found the following quote
Finally, in most cases, the Irony scanner does not need to distinguish between keywords and identifiers. It simply tokenizes all alpha-numeric words as identifiers, leaving it to the parser to differentiate them from each other. Some LEX-based solutions try to put too much responsibility on the scanner and make it recognize not only the token itself, but also its role in the surrounding context. This, in turn, requires extensive look-aheads, making the scanner code quite complex. In the author's opinion, identifying the token role is the responsibility of the parser, not the scanner. The scanner should recognize tokens without using any contextual information.
source: http://www.codeproject.com/Articles/22650/Irony-NET-Compiler-Construction-Kit

How do you use positional in Scala parser with files?

I'd like to get position information using the positional parser and Positional trait. I'd like to read the file Im parsing into a string (or something I can convert to a string) and then parse it while retaining the position information. Here's what I've found on 'positional':
https://wiki.scala-lang.org/display/SW/Parsing
...which off-handedly mentions StreamReader and CharArrayReader. Are there other options? Reading my file into something that can be used with a CharArrayReader may be what I need. If so, how does that work? If not, what should I be doing?
(FYI, StreamReader is out because I want to read and keep the file long before I parse it, not because I want to waste memory.)
Direct your favorite browser to the ScalaDoc HTML and navigate to the package scala.util.parsing.input where you'll find the available positioned readers:
CharArrayReader
CharSequenceReader
PagedSeqReader
StreamReader
Addendum
Creating one of these readers, CharArrayReader, e.g., is simple:
val car = new CharArrayReader("sample input")
Now you have the Reader to feed your parser.

How to get the string that caused parse error?

Suppose I have this code:
(handler-case (read ...)
(parse-error (condition)
(format t "What text was I reading last to get this error? ~s~&"
(how-to-get-this-text? condition))))
I can only see the parse-namestring accessors, but it gives the message of the error, not the text it was parsing.
EDIT
In my case the problem is less generic, so an alternative solution not involving the entire string that failed to parse can be good too.
Imagine this example code I'm trying to parse:
prefix(perhaps (nested (symbolic)) expressions))suffix
In some cases I need to stop on "suffix" and in others, I need to continue, the suffix itself has no other meaning but just being an indicator of the action the parser should take next.
READ parses from a stream, not a string. The s-expression can be arbitrarily long. Should READ keep a string of what's been read?
What you might need is a special stream. In standard Common Lisp there is no mechanism for user defined streams. But in real life every implementation has such extensible streams. See for example 'gray streams'.
http://www.sbcl.org/1.0/manual/Gray-Streams.html
There's no standard function to do it. You might be able to brute-force something with read-from-string, but whatever you do, it will require some extra work.

When use Jena to read RDF(N-Triple),it throws out "com.hp.hpl.jena.shared.InvalidPropertyURIException "

I download an N-Triple file from dbpedia,but when I wanted to read it in to Jena model,some exceptions throw out.Below is a part of this file:
<http://dbpedia.org/resource/Jacky_Cheung>
<http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u9AD4\u91CD"#zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8EAB\u9AD8"#zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8840\u578B"#zh .
<http://dbpedia.org/resource/Jacky_Cheung> <http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA> "\u8A9E\u8A00"#zh .
The exception throws out is:
Exception in thread "main" com.hp.hpl.jena.shared.InvalidPropertyURIException: http://dbpedia.org/resource/Template:%E8%97%9D%E4%BA%BA
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.splitTag(BaseXMLWriter.java:393)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.startElementTag(BaseXMLWriter.java:368)
at com.hp.hpl.jena.xmloutput.impl.Unparser$3.wTypeStart(Unparser.java:671)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltValueString(Unparser.java:488)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltValue(Unparser.java:473)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyElt(Unparser.java:339)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wPropertyEltStar(Unparser.java:811)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wTypedNodeOrDescriptionLong(Unparser.java:797)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wTypedNodeOrDescription(Unparser.java:727)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wDescription(Unparser.java:686)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wObj(Unparser.java:642)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wObjStar(Unparser.java:317)
at com.hp.hpl.jena.xmloutput.impl.Unparser.wRDF(Unparser.java:298)
at com.hp.hpl.jena.xmloutput.impl.Unparser.write(Unparser.java:200)
at com.hp.hpl.jena.xmloutput.impl.Abbreviated.writeBody(Abbreviated.java:143)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.writeXMLBody(BaseXMLWriter.java:500)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:472)
at com.hp.hpl.jena.xmloutput.impl.Abbreviated.write(Abbreviated.java:128)
at com.hp.hpl.jena.xmloutput.impl.BaseXMLWriter.write(BaseXMLWriter.java:458)
at com.hp.hpl.jena.rdf.model.impl.ModelCom.write(ModelCom.java:277)
at jena.ReadRDF.main(ReadRDF.java:45)
Java Result: 1
The problem is caused by "%E8%97%9D%E4%BA%BA",when use URIref.decode() to decode URI with this string,"%E8%97%9D%E4%BA%BA" represents tow Chinese characters.
But when I use Sesame to read this N-Triple file,it is OK without any problem.
My questions are that whether any way to solve this problem in Jena,and why dbpedia choose N-Triple to be the default RDF syntax?.It works bad with Non-ASCII languages.
Also ,I want to know that,if I want to publish my RDF data as Linked data,but the URIs of resources come with some Chinese and Japanese,should I decode the URIs at first?
Well, your question isn't completely clear because you asked about "reading in a Jena model" but the stacktrace you quoted actually starts with a call to the writer.
Jena, in general, tries very hard to conform to the relevant RDF recommendations from W3C and IETF. In particular, it tries to not generate any URI's which do not conform to the rules for valid URI's. This is compounded in the case of writing XML, because most RDF identifiers are not legal XML element ID's, meaning that you have to split the URI somewhere and use XML namespaces to make the full identifier. Not all RDF toolkits are as particular as Jena is about conforming to some of the rules in the standards.
Things you can try:
do you need to call Model.write() as part of your loading process? You should be able to load and process a model, without the check for legal URI's being invoked.
try writing the output using Turtle format, rather than XML. Turtle doesn't have the same restrictions as XML, and it's a heck of a lot easier for humans to read as well.
if there are particular ill-formed URI's in the data you are loading, look to see if there is a newer version of the data. Illegal URI's in dbpedia has been an issue in the past. If the illegal URI's are still there in the latest version, notify the dbpedia team about them.
try pre-processing your data to remove triples containing illegal URI's before they enter your processing chain.
As for URI's containing Chinese and Japanese characters, Jena conforms to the IRI spec, so as long as your URI's conform to that you should be OK.

Antlr3 - HIDDEN token in the parser

Can you use a token defined in the lexer in a hidden channel in a single rule of the parser as if it were a normal token?
The generated code is Java...
thanks
When you construct a CommonTokenStream, you tell it what channel to use. Tokens on other channels will not be visible to the parser.
Yes you can use a hidden token in the Parser.
We do this all the time. The only problem is that you need to know when to look for it.
Antlr has a few pieces of terminology that it uses.
A Hidden token just travels on a separate stream. The user can always check for hidden tokens by calling getHiddenAfter or getHiddenBefore on a currently matched token.
Note: There may be more than one token hidden, before or after, a matched token so you should iterate through them.
A Discarded token is actually removed when you tell the lexer to discard it. It will never be seen by you again.

Resources