Saref core based ontology and multiple measurements - ontology

I have multiple devices that can provide me with multiple instances. How do I take into consideration this cardinality when designing a saref ontology? I'm using protege.

Related

Limited number of clients used in federated learning

I just started studying federated learning and want to apply it to a certain dataset, and there are some questions that have risen up.
My data is containing records of 3 categories, each of which is having 3 departments. I am planning to have 3 different federated learning models for each category and treat the three department of this category as the distributed clients.
Is this possible? or building federated learning models requires having thousands of clients?
Thanks
Difficult to say by what you have provided in your question. Usually, when building a federated learning system, you are extending your centralized approach to one with data split/partitioned between segregated clients. Again, depending on the type of data you have, the type of task you are trying to solve and also the amount of data required to solve the task in a centralized approach, these factors along with other ones will depend how many clients you can use and how much data is required at each client. Additionally, the aggregation method that you wish to use combine the parameters from different clients will affect this. I suggest experimenting with different client numbers and partitioning methods and seeing what suits your needs.

How to combine ontologies?

I am a beginner in the field of ontologies, ontology alignment and composition of ontologies. What is the purpose of the composition of ontologies and on what basis it is performed and how ?
One of the main advantages of using ontologies is knowledge sharing. Different people from various backgrounds might develop the same ontology. This will often result in having different labels for the same concepts or relations. In order to be able to take advantage of having multiple ontologies in the same domain, for example for having a more comprehensive and expressive domain ontology, ontology matching/alignment comes to play. In the ontology matching, a mapping between concepts an relations of various ontologies is created.
For example, before national cancer institute came up with the first version of their cancer ontology, there were multiple ontologies modelling cancer around. They started by combining the various available ontologies and creating a central, more reliable ontology.
There are various algorithms for ontology matching. The algorithms normally are categorised based on:
input
process
output
Broadly putting it, you can either match on element to element basis, or based on the structure. The tools that can be used for matching can be linguistic resources such as WordNet for semantic matching, or domain specific resources, statistical approaches, taxonomy, various models, and etc.There are too much research in this area and you should really consider using google scholar.

Neo4J Performance Benchmarking

I have created a basic implementation of high level client over Neo4J (https://github.com/impetus-opensource/Kundera/tree/trunk/kundera-neo4j) and want to compare its performance with Native neo4j driver (and maybe SpringData too). This way I would be able to determine overhead my library is putting over native driver.
I plan to create an extension of YCSB for Neo4J.
My question is: what should be considered as a basic unit of object to be written into neo4j (should it be a single node or a couple of nodes joined by an edge).
What's current practice in Neo4J world. How people benchmarking neo4j performance are doing it.
There's already been some work for benchmarking Neo4J with Gatling: http://maxdemarzi.com/2013/02/14/neo4j-and-gatling-sitting-in-a-tree-performance-t-e-s-t-ing/
You could maybe adapt it.
See graphdb-benchmarks
The project graphdb-benchmarks is a benchmark between popular graph dataases. Currently the framework supports Titan, OrientDB, Neo4j and Sparksee. The purpose of this benchmark is to examine the performance of each graph database in terms of execution time. The benchmark is composed of four workloads, Clustering, Massive Insertion, Single Insertion and Query Workload. Every workload has been designed to simulate common operations in graph database systems.
Clustering Workload (CW): CW consists of a well-known community detection algorithm for modularity optimization, the Louvain Method. We adapt the algorithm on top of the benchmarked graph databases and employ cache techniques to take advantage of both graph database capabilities and in-memory execution speed. We measure the time the algorithm needs to converge.
Massive Insertion Workload (MIW): Create the graph database and configure it for massive loading, then we populate it with a particular dataset. We measure the time for the creation of the whole graph.
Single Insertion Workload (SIW): Create the graph database and load it with a particular dataset. Every object insertion (node or edge) is committed directly and the graph is constructed incrementally. We measure the insertion time per block, which consists of one thousand edges and the nodes that appear during the insertion of these edges.
Query Workload (QW): Execute three common queries:
FindNeighbours (FN): finds the neighbours of all nodes.
FindAdjacentNodes (FA): finds the adjacent nodes of all edges.
FindShortestPath (FS): finds the shortest path between the first node and 100 randomly picked nodes.
One way to performance-test is to use e.g. http://gatling-tool.org/. There is work underway to create benchmark frameworks at http://ldbc.eu . Otherwise, benchmarking is highly dependent on your domain dataset and the queries you are trying to do. Maybe you could start at https://github.com/neo4j/performance-benchmark and improve on it?

what is the advantage of RDF and Triple Storage to Neo4j?

Neo4j is a really fast and scalable graph database, it seems that it can be used on business projects and it is free, too!
At the same time, there are no RDF triple stores that work well with large data or deliver a high-speed access. And what is more, free RDF triple stores perform even worse.
So what is the advantage of RDF and RDF triple stores to Neo4j?
The advantage of using a triple store for RDF rather than Neo4j is that that's what they're designed for. Neo4j is pretty good for many use cases, but in my experience its performance for loading and querying RDF is well below all dedicated RDF databases.
It's a fallacy that RDF databases don't scale or are not fast. Sure, they're not yet up to the performance & scale levels that relational databases have, but they have a 50 year head start. Many triple stores scale into the billions of triples, provide 'standard' enterprise features, and provide great performance for many use cases.
If you're going to use RDF for a project, use a triple store; it's going to provide the best performance and set of features/APIs for working with RDF to build your application.
RDF and SPARQL are standards, so you have a choice of multiple implementations, and can migrate your data from one RDF store to another.
Additionally, version 1.1 of the SPARQL query language is quite sophisticated (more expressive than most SQL implementations) and can do all kinds of queries that would require a lot of code to be written in Neo4J.
If you are going for graph mining (e.g., graph traversal) upon triples, neo4j is a good choice. For the large triples, you might want to use its batchInserter which is fairly fast.
So I think it's all about your use case. Both technologies can and do overlap.
In my mind, there its mostly about the use case. Do you want a full knowledge graph including all the ecosystems from the semantic web? Then go for the triple store.
If you need a general-purpose graph (e.g. store big data as a graph) use the property graph model. My reasoning is, that the underlying philosophy is very much different and this starts with how the data is stored which has implications for your usage scenario.
let's do some out-of-mind bullet points here to compare. Take it with a grain of salt please as this is not a benchmark paper just some experience-based 5 min write down.
Property graph (neo4j):
Think of nodes/Edges as documents
Implemented on top of e.g. linked list, key-value stores (deep searches, large data e.g. via gremlin)
Support for OWL/RDF but not natively (as i see its on a meta layer)
Really great when it comes to having the data in the graph and doing ML (it stores it as linked lists that gives you nice vectors which is cool for ML out of the box)
Made for large data at scale.
Use Cases: (focus is on the data entities and not their classes)
Social Graphs and other scenarios where you need deep traversal
Large data graphs, where you have a lot of documents that need to be searched in a schema-free graph manner .
Analyzing customer funnels from click data etc. You want to move out of your relational schema because actually, you are in a graph use case...
Triple Store (E.g. rdf4j)
Think of data in maximum normal form as triples (no redundant data at all)
Triples are stored in context triples. Works a lot with index.
Broad but searches and specific knowledge extractions. Deep searches are sometimes cumbersome.
Scale is impressive and can scale to trillions of nodes with fast performance. But i would not recommend storing big data in the graph e.g. time-series or so. The reason is the special way how indexes are used and in order to scale horizontally, you may consider working with subgraphs ...
Support for all the ecosystems like SPARQL, SHACL, SWIRL etc. this is a big plus in case
Use cases:
It's really about knowledge graphs. Do you need shape testing, rule evaluation, inference, and reasoning? Go for it because you have to focus on the ontology and class structure!
Also e.g. you have IoT and want to configure relations for logistics and smart factory while the telemetry is stored somewhere else and only referenced in the graph.
I have heard rumors that it takes whole day to load 10M triples into Neo4j (it is actually the slowest one because it's not built primarily for RDF).
Sesame and 4Store are the fastest ones but Jena has powerful API.

Datamart vs. reporting Cube, what are the differences?

The terms are used all over the place, and I don't know of crisp definitions. I'm pretty sure I know what a data mart is. And I've created reporting cubes with tools like Business Objects and Cognos.
I've also had folks tell me that a datamart is more than just a collection of cubes.
I've also had people tell me that a datamart is a reporting cube, nothing more.
What are the distinctions you understand?
Cube can (and arguably should) mean something quite specific - OLAP artifacts presented through an OLAP server such as MS Analysis Services or Oracle (nee Hyperion) Essbase. However, it also gets used much more loosely. OLAP cubes of this sort use cube-aware query tools which use a different API to a standard relational database. Typically OLAP servers maintain their own optimised data structures (known as MOLAP), although they can be implemented as a front-end to a relational data source (known as ROLAP) or in various hybrid modes (known as HOLAP)
I try to be specific and use 'cube' specifically to refer to cubes on OLAP servers such as SSAS.
Business Objects works by querying data through one or more sources (which could be relational databases, OLAP cubes, or flat files) and creating an in-memory data structure called a MicroCube which it uses to support interactive slice-and-dice activities. Analysis Services and MSQuery can make a cube (.cub) file which can be opened by the AS client software or Excel and sliced-and-diced in a similar manner. IIRC Recent versions of Business Objects can also open .cub files.
To be pedantic I think Business Objects sits in a 'semi-structured reporting' space somewhere between a true OLAP system such as ProClarity and ad-hoc reporting tool such as Report Builder, Oracle Discoverer or Brio. Round trips to the Query Panel make it somewhat clunky as a pure stream-of-thought OLAP tool but it does offer a level of interactivity that traditional reports don't. I see the sweet spot of Business Objects as sitting in two places: ad-hoc reporting by staff not necessarily familiar with SQL and provding a scheduled report delivered in an interactive format that allows some drill-down into the data.
'Data Mart' is also a fairly loosely used term and can mean any user-facing data access medium for a data warehouse system. The definition may or may not include the reporting tools and metadata layers, reporting layer tables or other items such as Cubes or other analytic systems.
I tend to think of a data mart as the database from which the reporting is done, particularly if it is a readily definable subsystem of the overall data warehouse architecture. However it is quite reasonable to think of it as the user facing reporting layer, particularly if there are ad-hoc reporting tools such as Business Objects or OLAP systems that allow end-users to get at the data directly.
The term "data mart" has become somewhat ambiguous, but it is traditionally associated with a subject-oriented subset of an organization's information systems. Data mart does not explicitly imply the presence of a multi-dimensional technology such as OLAP and data mart does not explicitly imply the presence of summarized numerical data.
A cube, on the other hand, tends to imply that data is presented using a multi-dimensional nomenclature (typically an OLAP technology) and that the data is generally summarized as intersections of multiple hierarchies. (i.e. the net worth of your family vs. your personal net worth and everything in between) Generally, “cube” implies something very specific whereas “data mart” tends to be a little more general.
I suppose in OOP speak you could accurately say that a data mart “has-a” cube, “has-a” relational database, “has-a” nifty reporting interface, etc… but it would be less correct to say that any one of those individually “is-a” data mart. The term data mart is more inclusive.
Data mart is a collection of data of a specific business process. It is irrelevant how the data is stored. A cube stores data in a special way, multiple-dimension, unlike a table with row and column. A cube in a olap database is like a table to traditional database. A data mart can have tables or cubes. Cubes make the analysis faster because it pre-calculates aggregations ahead of time.
As the name suggests, a cube is a structured multidimensional data-set, (typically three dimensions each representing three sides of a cube). A data mart is just a container and not a structure by itself, although it contains data-sets flatly organized (as tables) in dimensions and facts.
The structure of a cube makes it easy to visualize or conceptualize data along various dimensions of a cube. Thus most business analysts or developers find it easy to query and interact with the cube.
Since a data mart is just a container with a bunch of tables; users need to first conceptualize and understand dimensional structures before querying and analyzing data.
Data mart traditionally has meant static data, usually date/time oriented, used by analysts for statistics, budgeting, performance and sales reporting, and other planning activities.
A Cube is an OLAP database that pretty exhaustively converts OLTP data into a static, date/time-oriented schema that uses a query language that is not SQL, but built specifically for answering data mart type questions. It uses terms like measures, dimensions, star-schema, etc. rather than tables, columns, and rows. The best familiar analogy might be pivot-tables in a spreadsheet.
Remember:
Data Warehousing is the process of taking data from legacy and transaction database systems and transforming it into organized information in a user-friendly format to encourage data analysis and support fact-based business decision making.
A Data Warehouse is a system that extracts, cleans, conforms, and delivers
source data into a dimensional data store and then supports and implements
querying and analysis for the purpose of decision making.
KIMBALL e.g. consistently has defined data mart as a process-oriented subset of the overall organization’s data based on a foundation of atomic data, and that depends only on the physics of the data-measurement events, not on the anticipated user’s questions.
Data marts are based on the source of data, not on a department’s view of data.
Data marts contain all atomic detail needed to support drilling down to the lowest level.
Data marts can be centrally controlled or decentralized.
CORRECT DEFINITION
Process based
Atomic Data Foundation
Data Measurement
MISGUIDED DEFINITION
Department Based
Aggregate Data Only
User Question Based
To me, a datamart is just place where data gets dumped in a relatively flat, unusable format.
Cube is taking that data and making it dance.
I agree with Matthew. We tend to use the term 'Data Mart' for any data source that stores generic data and mappings used across various applications in an enterprize. We don't store measurable data in a data mart, so I see a data mart as one of multiple data sources for a cube. This, however, is how we do it. I am sure there is nothing preventing you from storing measurable data in a data mart.

Resources