populate an existing ontology from a csv file using Jena - jena

How to read an ontology (owl file) using jena and populate this ontology (ontModel) from a CSV file then write the populated OntModel into OWL file

There are three parts to your question:
reading an OWL file into a Jena Model
converting a CSV file into RDF
writing the contents of a Jena Model out to a file
The first and third of these are easy with Jena (see Model.read() and Model.write() methods, and the FileManager for some additional convenience support for reading from different locations).
The second part is the tricky one. Typically, when converting a CSV file to RDF, we assume that each row represents one RDF resource and its properties. You have three tasks to achieve:
Determining the URI that represents the resource, based on some key in the row of data
Determining the URI of the RDF property that represents the value of a given column
Mapping each column value to an appropriate resource URI or literal value.
For example, consider the following CSV:
id,name,age,occupation
2718,fred,107,ninja
We can use the first row of the CSV to suggest RDF predicate names. foaf:name and foaf:age would be appropriate choices for the first two columns, but we may need a new predicate in our namespace for the third column http://example.com/vocab#occupation. The resource URI will be based on whatever the key is for the data, in this case the id column, suggesting that the URI for the resource denoted by the first row will be http://example.com/data/employee/2718. Finally we have to map the data. The name is just a string, the age is an integer and the occupation is a resource. Given those choices, we may end up with output like:
<http://example.org/data/employee/2718>
a foaf:Person;
foaf:name "fred";
foaf:age "107"^^xsd:integer;
example_com:occupation <http://dbpedia.org/resource/Ninja>.
The W3C working draft R2RML defines a standardised mapping language for performing these kinds of translations. Various implementations of R2RML are available. Of course, if your mapping is fairly stable it would be perfectly straightforward just to write some code to perform the translation from CSV for your particular input data.

Related

Partial deserialization with Apache Avro

Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.
Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!
It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.

How to define large set of properties of a node without having to type them all?

I have imported a csv file into neo4j. I have been trying to define a large number of properties (all the columns) for each node. How can i do that without having to type in each name?
I have been trying this:
USING PERIODIC COMMIT
load csv WITH headers from "file:///frozen_catalog.csv" AS line
//Creating nodes for each product id with its properties
CREATE (product:product{id : line.`o_prd`,
Gross_Price_Average: TOINT(line.`Gross_Price_Average`),
O_PRD_SPG: TOINT(line.`O_PRD_SPG`)});
You can adding properties from maps. For example:
LOAD CSV WITH HEADERS FROM "http://data.neo4j.com/northwind/products.csv" AS row
MERGE (P:Product {productID: row.productID})
SET P += row
http://neo4j.com/docs/developer-manual/current/cypher/clauses/set/#set-adding-properties-from-maps
The LOAD CSV command cannot perform automatic type conversion to ints on certain fields, that must be done explicitly (though you can avoid having to explicitly mention all other fields by using the map projection feature to transform your line data before setting it via stdob--'s suggestion).
You may want to take a look at Neo4j's import tool, as this will allow you to specify field type in headers, which should perform type conversion for you.
That said, 77 columns is a lot of data to all store on individual nodes. You may want to take another look at your data and figure out if some of those properties would be better modeled as nodes with their own label with relationships to your product nodes. You mentioned some of these were categorical properties. Categories are well suited to be modeled separately as nodes instead of as properties, and maybe some of your other properties would work better as nodes as well.

Difference between Direct Mapping and R2RML

I've tried to figure out what's differences between the two rdb2rdf mapping languages Direct Mapping and R2RML are.
I understand that booth languages generate RDF files that stand for a virtual RDF graph - which can be accessed via SPARQL.
So what's the point in having two W3C languages/standards doing the same!?
The two standards don't do the same.
Direct Mapping is a default, convention-based algorithm to convert relational data into RDF graphs. It defines how tables, primary keys, relationships, etc. are converted.
On the other hand R2RML is a language, with which you can create your own mappings, including Direct Mapping. As examples it gives you various ways to construct URLs, map tables to RDF classes or map custom SQL SELECT statements instead of tables.
R2RML defines a relaxed variant of the Direct Mapping intended as a default mapping for further customization.
So, R2RML actually includes a definition of Direct Mapping. Implementing tools can generate mappings from existing database, which can be further adjusted.
RDB to RDF mapping tools like D2RQ and SPIDER use a language to provide online mapping from a relational database to RDF, which means data are converted to RDF on the fly. Data can be converted directly without any user customization or users should specify the columns and the mapping predicates accordingly. The former is called directed mapping which is usually used for simple RDB databases, but for a relational database with complex structure, R2RML language is used for mapping.

How to create dynamic parser?

I want to create something called dynamic parser.
My project input is some data file like XML, Excel, CSVand ... file and I must parse it and extract its records and its fields and finally save it to SQL Server database.
My problem is that fields of the record is dynamic and I can not write parser in development time. I must provide parser in run-time. By dynamic I mean a user select each record fields using a Web UI. So, I know the numbers of fields in each record in run-time and some information about each field like its name and so on.
I discussed this type of project in question titled 'Design Pattern for Custom Fields in Relational Database'.
I also looked at Parser Generator but i did not get enought information about it and I don't know it is really related to my problem or not.
Is there any design pattern for this type of problem?
If you know the number of fields and the field names then extract the data from the file and then build a query using string concatenation

Use CSV to populate Neo4j

I am very new for Neo4j. I am a learner of this graph database. I need to load a csv file into Neo4j database. I am trying from 2 days,I couldn't able to find good information of reading the csv file in to Neo4j. Please suggest me wil sample code or blogs of reading csv file into Neo4j.
Example:
Suppose if i have a csv file in This way how can we read it into Neo4j
id name language
1 Victor Richards West Frisian
2 Virginia Shaw Korean
3 Lois Simpson Belarusian
4 Randy Bishop Hiri Motu
5 Lori Mendoza Tok Pisin
You may want to try https://github.com/sroycode/neo4j-import
This populates data directly from a pair of CSV files ( entries must be COMMA separated )
To build: (you need maven)
sh build.sh
The nodes file has a mandatory field id and any other fields you like
NODES.txt
id,name,language
1,Victor Richards,West Frisian
2,Virginia Shaw,Korean
3,Lois Simpson,Belarusian
The relationships file has 3 mandatory fields from,to,type. Assuming you have a field age ( long integer), and info, the relations file will look like
RELNS.txt
from,to,type,age#long,info
1,2,KNOWS,10,known each other from school
1,3,CLUBMATES,5,member of country club
Running:
sh run.sh graph.db NODES.txt RELNS.txt
will create graph.db in the current folder which you can copy to the neo4j data folder.
Note:
If you are using neo4j later than 1.6.* , please add this line in conf/neo4j.properties
allow_store_upgrade = true
Have fun.
Please take a look at https://github.com/jexp/batch-import
Can be used as starting point
There is nothing available to generically load CSV data into Neo4j because the source and destination data structures are different: CSV data is tabular whereas Neo4j holds graph data.
In order to achieve such an import, you will need to add a separate step to translate your tabular data into some form of graph (e.g. a tree) before it can be loaded into Neo4j. Taking the tree structure further as an example, the page below shows how XML data can be converted into Cypher which may then be directly executed against a Neo4j instance.
http://geoff.nigelsmall.net/xml2graph/
Please feel free to use this tool if it helps (bear in mind it can only deal with small files) but this will of course require you to convert your CSV to XML first.
Cheers
Nigel
there is probably no known CSV importer for neo4j, you must import it yourself:
i usually do it myself via gremlin's g.loadGraphML(); function.
http://docs.neo4j.org/chunked/snapshot/gremlin-plugin.html#rest-api-load-a-sample-graph
i parse my data with some external script into the xml syntax and load the particular xml file. you can view the syntax here:
https://raw.github.com/tinkerpop/gremlin/master/data/graph-example-1.xml
parsing an 100mb file takes few minutes.
in your case what you need to do is a simple bipartite graph with vertices consisting of users and languages, and edges of "speaks". if you know some programming, then create user nodes with parameters id, name | unique language nodes with parameters name | relationships where you need to connect each user with the particular language. note that users can be duplicite whereas languages can't.
I believe your question is too generic. What does your csv file contain? Logical meaning of the contents of a csv file can vary very much. An example of two columns with IDs, which would represent entities connected to each other.
3921 584
831 9891
3841 92
...
In this case you could either write a BatchInserter code snippet which would import it faster, see http://docs.neo4j.org/chunked/milestone/batchinsert.html.
Or you could import using regular GraphDatabaseService with transaction sizes of a couple of thousands inserts for performance. See how to setup and use the graph db at http://docs.neo4j.org/chunked/milestone/tutorials-java-embedded.html.

Resources