Extracting FOAF information from Jena - jena

I'm new here,and i have some problem about FOAF. I use jena create a FOAF like this :
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
<foaf:phone>12312312312</foaf:phone>
<foaf:nick>L</foaf:nick>
<foaf:name>zhanglu</foaf:name>
But i want to the FOAF shows like this:
<foaf:Person>
<foaf:phone>12312312312</foaf:phone>
<foaf:nick>L</foaf:nick>
<foaf:name>zhanglu</foaf:name>
</foaf:Person>
What can i do?
this is my source code:
Model m = ModelFactory.createDefaultModel();
m.setNsPrefix("foaf", FOAF.NS);
Resource r = m.createResource(NS);
r.addLiteral(FOAF.name, "zhanglu");
r.addProperty(FOAF.nick, "L");
r.addProperty(FOAF.phone, "123123123");
r.addProperty(RDF.type, FOAF.Person);
FileOutputStream f = new FileOutputStream(fileName);
m.write(f);
who can tell me?
thanks

First thing to say is that the two forms that you quote have exactly the same meaning to RDF - that is, they produce exactly the same set of triples when parsed into an RDF graph. For this reason, it's generally not worth worrying about the exact syntax of the XML produced by the writer. RDF/XML is, in general, not a friendly syntax to read. If you just want to serialize the Model, so that you can read it in again later, I would suggest Turtle syntax as it's more compact and easier for humans to read and understand.
However, there is one reason you might want to care specifically about the XML serialization, which is if you want the file to be part of an XML processing pipeline (e.g. XSLT or similar). In this case, you can produce the format you want by changing the last line of your example:
m.write( f, "RDF/XML-ABBREV" );
or, equivalently,
m.write( f, FileUtils.langXMLAbbrev );

Related

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik
The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

Which templating languages output HTML *as a tree of nodes*?

HTML is a tree of nodes, before all. It's not just a text.
However, most templating engines handle their input and output as it was just a text; they don't care what happens around their tags, their {$foo}'s and <% bar() %>'s; also they don't care about what are they outputting. Sometimes they happen to produce a correct html, but that's just a coincidence; they didn't aim for that, all they wanted is to replace some funny marks in the text stream with their evaluation.
There are a few templating engines which do treat their output as a set of nodes; XSLT and Haml come to mind. For some tasks, this has advantages: for example, you can automatically reformat (like, delete all empty text nodes; auto-indent; word-wrap). The result is guaranteed to be a correct xml/sgml unless you use a strict subset of operations that can break that. Also, such templating engine would automatically quote strings, differently in text nodes and in attributes, because it strictly knows whether you're writing an attribute or a text node.
Moreover, it can conditionally remove a node from output because it knows where it does begin and end, which is useful, and do other non-trivial node operations.
You might not like XSLT for its verbosiness or functionalness, but it's damn helps that your template is xmllint-able XML, and your output is a good sgml/xml.
So the question is: Which template engines do you know that treat their output as a set of correct nodes, not just an unstructured text?
I know XSLT, Haml and some obscure python-based one. Moar!
Which template engines do you know
that treat their output as a set of
correct nodes
Surprisingly ASP.NET does! You can change the HTML output of the page through a kind of DOM if you want: http://en.wikipedia.org/wiki/Asp.net#Rendering_technique
HTML combinator libraries, e.g.:
Wing Beats (F#)
SharpDOM (C#)
CityLizard (.NET, any language)
BlazeHtml (Haskell)
Ocsigen / Eliom (OCaml)
XML literals as a language feature can easily act as a "HTML template language"
VB.NET XML literals
Hasic (uses VB.NET XML literals)
Scala XML literals
Hamlet (uses Template Haskell)
though these are all embedded DSLs, dependent on their host language
There is a standard way of representing XML (or a subset of it such as XHTML) in Scheme known as SXML. This is one of the things that, in my opinion, make Scheme a good language for web development. It is possible to build up the contents of a document as a native Scheme list, and then render this to (correct) XHTML in one function call.
Here is an example that takes a simple text string, and wraps it as the contents of a one-paragraph HTML page. So the function as-page is acting as a template; its output is a Scheme list which can be easily translated to its equivalent HTML. Unbalanced or malformed tags are not possible with this approach.
(use-modules (sxml simple))
(define (as-page txt)
`(html
(head (title "A web page"))
(body (p ,txt))))
(as-page "It works!!!!!")
;; $2 = (html (head (title "A web page")) (body (p "It works!!!!")))
(sxml->xml (as-page "It works!!!!"))
;; $3 = <html><head><title>A web page</title></head><body><p>It works!!!!</p></body></html>
TAL (originally part of Zope but now implemented in a variety of languages) is XML-based. It's very logical and intention-revealing to work with - instead of shoving in a heap of text you're telling the template something like "set the href attribute of this link to http://google.com/ and set its text content to 'Search Google'. You don't have to manage which strings need to be escaped - generally if you intend something to be interpreted as markup, you put it in a template, and if you don't intend it to be interpreted as markup you feed it in as an attribute value or text content and TAL will escape it correctly.
Basically all templating engines which use XML as their file format (for defining templates). By using XML, they enforce that the file must be well-formed.
[EDIT] Examples are: Genshi (Python) or JSP 2.0 (Java).
With the Nagare web framework, the views are always a tree of XML nodes, directly built in Python.
The tree can then be manipulated in Python, transformed with XSL, serialized in HTML or XHTML ...
(the 'nagare.namespaces' package comes with the Nagare projet but can be used in any Python application)
Example:
>>> from nagare.namespaces import xhtml_base
>>> h = xhtml_base.Renderer() # The XHTML tree builder
>>>
>>> # Building the tree of nodes
>>> with h.html:
>>> with h.body:
>>> h << h.h1('Hello world')
>>> tree = h.root # The tree root element
>>>
>>> print tree.write_xmlstring() # Tree serialized in XML
<html><body><h1>Hello world</h1></body></html>
I maintain a list of push-style templating systems here:
http://perlmonks.org/?node_id=674273
And am in the process of evaluating various Perl templating systems for their separation index:
http://bit.ly/bXaYt7
But the tree-based one is written by me, HTML::Seamstress -
http://search.cpan.org/dist/HTML-Seamstress/
The term "push-style" comes from Terence Parr's Paper "Enforcing Strict Model-View
Separation in Template Engines" -
http://www.cs.usfca.edu/~parrt/papers/mvc.templates.pdf
Also, http://snapframework.com/docs/tutorials/heist from Haskell's snap seems to fit.
TAL is absolutely not push-style. It may be XML-based but it is pull-style (the most degenerate form of push-style).

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

I have to read invoice data from a convoluted ASCII file, how would you guard against future changes?

I have to read invoice ascii files that are structured in a really convoluted way, for example:
55651108 3090617.10.0806:46:32101639Example Company Construction Company Example Road. 9 9524 Example City
There's actually additional stuff in there, but I don't want to confuse you any further.
I know I'm doomed if the client can't offer a better structure. For instance 30906 is an iterative number that grows. 101639 is the CustomerId. The whitespaces between "Example Company" and "Construction Company" are of variable length The field "Example Company" could have whitespaces of variable length too however, for instance "Microsoft Corporation Redmond". Same with the other fields. So there's no clear way to extract data from the latter part.
But that's not the question. I got taken away. My question is as follows:
If the input was somewhat structured and well defined, how would you guard against future changes in its structure. How would you design and implement a reader.
I was thinking of using a simple EAV Model in my DB, and use text or xml templates that describe the input, the entity names, and their valuetypes. I would parse the invoice files according to the templates.
"If the input was somewhat structured and well defined, how would you guard against future changes in its structure. How would you design and implement a reader?"
You must define the layout in a way you can flexibly pick it apart.
Here's a python version
class Field( object ):
def __init__( self, name, size ):
self.name= name
self.size = size
self.offset= None
class Record( object ):
def __init__( self, fieldList ):
self.fields= fieldList
self.fieldMap= {}
offset= 0
for f in self.fields:
f.offset= offset
offset += f.size
self.fieldMap[f.name]= f
def parse( self, aLine ):
self.buffer= aLine
def get( self, aField ):
fld= self.fieldMap[aField]
return self.buffer[ fld.offset:fld.offset+fld.size+1 ]
def __getattr__( self, aField ):
return self.get(aField)
Now you can define records
myRecord= Record(
Field('aField',8),
Field('filler',1),
Field('another',5),
Field('somethingElse',8),
)
This gives you a fighting chance of picking apart some input in a reasonably flexible way.
myRecord.parse(input)
myRecord.get('aField')
Once you can parse, adding conversions is a matter of subclassing Field to define the various types (dates, amounts, etc.)
I believe that a template describing the entity names and the value types is good one. Something like a "schema" for a text file.
What I would try to do is to separate the reader from the rest of the application as much as possible. So, the question really is, how to define an interface that will be able to accommodate for changes in the parameters list. This is may not be always possible, but still, if you are relying on an interface to read the data, you could change the implementation of the reader without affecting the rest of the system.
Well, your file format looks much like the french protocol called Etebac used between banks and their customers.
It's a fixed width text format.
The best you can do is use some kind of unpack function :
$ perl -MData::Dumper -e 'print Dumper(unpack("A8 x A5 A8 A8 A6 A30 A30", "55651108 3090617.10.0806:46:32101639Example Company Construction Company Example Road. 9 9524 Example City"))'
$VAR1 = '55651108';
$VAR2 = '30906';
$VAR3 = '17.10.08';
$VAR4 = '06:46:32';
$VAR5 = '101639';
$VAR6 = 'Example Company';
$VAR7 = 'Construction Company';
What you should do is for every input, check that it is what it's supposed to be, that is, XX.XX.XX, or YY:YY:YY or that it does not start with a space, and abort if it does.
I'd have a database of invoice data, with tables such as Company, Invoices, Invoice_Items. Depends on complexity, do you wish to record your orders as well, and then link invoices to the orders, and so on? But I digress...
I'd have an in-memory model of the database model, but that's a given. If XML output and input was needed, I would have an XML serialisation of the model if I needed to supply the invoices as data elsewhere, and a SAX parser to read it in. Some APIs might make this trivial to do, or maybe you just want to expose a web service to your repository if you are going to have clients reading from you.
As for reading in the text files (and there isn't much information relating to them - why would the format of these change? where are they coming from? Are you replacing this system, or will it keep on running, and you're just a new backend that that they're feeding?) You say the number of spaces is variable - is that just because the format is fixed-width columns? I would create a reader that would read them into your model, and hence your database schema.

Resources