reason over a rather large ontology - ontology

I have a rather large ontology (about 80 GB). I think the reasoner brings the whole ontology into the memory for reasoning process, doesn't it?
is there any way to reason over a 80 GB ontology with 16 GB of RAM?

Before you even start with reasoning the ontology will need to be loaded in full by Protege.
On top of that the reasoner will require further memory to do the actual reasoning. How much memory the reasoner will require is highly dependent on the expressivity of the ontology (i.e., EL++ vs OWL DL), the characteristics of the ontology (i.e., lots of OR branching will require more memory), and reasoner you plan to use. The following paper may be useful in this regard.
It may be worth thinking about the following:
(1) What are the inferences that you hope to derive through reasoning? If only a small number of entailments can be expected it may not be worth the effort.
(2) Is it possible to break up the ontology into smaller modules in such a way that reasoning on the module will give the required entailments?
(3) If your ontology contains a large number of individuals what you may rather need is a triple store on which you can execute rules rather than a reasoner.

Related

why the concurrency performance of R-tree is bad

In many papers, they mentioned that concurrency performance is bad but they don't give the explanation about why it is bad.
Can anyone give me a hint?
Querying is not problematic at all.
Only changes to the tree are somewhat expensive and require locking. They are worse than e.g. B trees, because you may need to update the bounding boxes all the way to the root. The R* tree also does a special kind of reinsertions to balance the tree. But overall, it's comparable to the B tree and any other page oriented tree. You need locking for pages you write to.
When you look at the insertion/deletion strategies of R-Trees (or R*Tree, R+Tree, ...) you can see that an overflow/underflow of a node can easily cascade through large parts of the tree (also called rebalancing). This requires locking a lot of nodes, which is obviously bad for concurrency.
Instead of locking you can attempt an copy-on-write strategy, but that would also be expensive because of copying a lot of nodes and a significant probablility of conflicts with other writing threads.

What to look for in optimizing already heavily optimized embeded C++ code in term of memory?

What to look for to optimize heavily optimized embedded DSP code in terms of memory that is outside of the obvious?
I need to reduce the memory by at least 10 percent.
In DSP applications, the requirements for precision and/or quantization of data types and saved intermediate data can often be analysed. If the minimum requirements are not multiples of 256 or 8-bits, it might be possible to reformat and pack the data type elements in non-byte-aligned structures or arrays to save data memory. Of course, this comes with the trade-off of a higher computational cost and code footprint accessing said data, which may or may not be significant in your application.

What is the difference between Yago and DBpedia taxonomies?

Both of them are widely used to type DBpedia resources but it seems that YAGO has much more classes or concepts organized using rdfs:subClassOf predicate. Despite this, it is not clear if, for example, that class hierarchy is a DAG (like in DBpedia), how many classes conform it, etc.
DBpedia is a community effort to extract structured information from Wikipedia. In this sense, both YAGO and DBpedia share the same goal of generating a structured ontology. The projects differ in their foci. In YAGO, the focus is on precision, the taxonomic structure, and the spatial and temporal dimension. For a detailed comparison of the projects, see Chapter 10.3 of our AI journal paper "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia".
[Link: http://resources.mpi-inf.mpg.de/yago-naga/yago/publications/aij.pdf]

Neo4j partition

Is the a way to physically separate between neo4j partitions?
Meaning the following query will go to node1:
Match (a:User:Facebook)
While this query will go to another node (maybe hosted on docker)
Match (b:User:Google)
this is the case:
i want to store data of several clients under neo4j, hopefully lots of them.
now, i'm not sure about whats is the best design for that but it has to fulfill few conditions:
no mixed data should be returned from a cypher query ( its really hard to make sure, that no developer will forget the ":Partition1" (for example) in a cypher query)
performance of 1 client shouldn't affect another client, for example, if 1 client has lots of data, and another client has small amount of data, or if a "heavy" query of 1 client is currently running, i dont want other "lite" queries of another client to suffer from slow slow performance
in other words, storing everything under 1 node, at some point in the future, i think, will have scalability problem, when i'll have more clients.
btw, is it common to have few clusters?
also whats the advantage of partitioning over creating different Label for each client? for example: Users_client_1 , Users_client_2 etc
Short answer: no, there isn't.
Neo4j has high availability (HA) clusters where you can make a copy of your entire graph on many machines, and then serve many requests against that copy quickly, but they don't partition a really huge graph so some of it is stored here, some other parts there, and then connected by one query mechanism.
More detailed answer: graph partitioning is a hard problem, subject to ongoing research. You can read more about it over at wikipedia, but the gist is that when you create partitions, you're splitting your graph up into multiple different locations, and then needing to deal with the complication of relationships that cross partitions. Crossing partitions is an expensive operation, so the real question when partitioning is, how do you partition such that the need to cross partitions in a query comes up as infrequently as possible?
That's a really hard question, since it depends not only on the data model but on the access patterns, which may change.
Here's how bad the situation is (quote stolen):
Typically, graph partition problems fall under the category of NP-hard
problems. Solutions to these problems are generally derived using
heuristics and approximation algorithms.[3] However, uniform graph
partitioning or a balanced graph partition problem can be shown to be
NP-complete to approximate within any finite factor.[1] Even for
special graph classes such as trees and grids, no reasonable
approximation algorithms exist,[4] unless P=NP. Grids are a
particularly interesting case since they model the graphs resulting
from Finite Element Model (FEM) simulations. When not only the number
of edges between the components is approximated, but also the sizes of
the components, it can be shown that no reasonable fully polynomial
algorithms exist for these graphs.
Not to leave you with too much doom and gloom, plenty of people have partitioned big graphs. Facebook and twitter do it every day, so you can read about FlockDB on the twitter side or avail yourself of relevant facebook research. But to summarize and cut to the chase, it depends on your data and most people who partition design a custom partitioning strategy, it's not something software does for them.
Finally, other architectures (such as Apache Giraph) can auto-partition in some senses; if you store a graph on top of hadoop, and hadoop already automagically scales across a cluster, then technically this is partitioning your graph for you, automagically. Cool, right? Well...cool until you realize that you still have to execute graph traversal operations all over the place, which may perform very poorly owing to the fact that all of those partitions have to be traversed, the performance situation you're usually trying to avoid by partitioning wisely in the first place.

what is the advantage of RDF and Triple Storage to Neo4j?

Neo4j is a really fast and scalable graph database, it seems that it can be used on business projects and it is free, too!
At the same time, there are no RDF triple stores that work well with large data or deliver a high-speed access. And what is more, free RDF triple stores perform even worse.
So what is the advantage of RDF and RDF triple stores to Neo4j?
The advantage of using a triple store for RDF rather than Neo4j is that that's what they're designed for. Neo4j is pretty good for many use cases, but in my experience its performance for loading and querying RDF is well below all dedicated RDF databases.
It's a fallacy that RDF databases don't scale or are not fast. Sure, they're not yet up to the performance & scale levels that relational databases have, but they have a 50 year head start. Many triple stores scale into the billions of triples, provide 'standard' enterprise features, and provide great performance for many use cases.
If you're going to use RDF for a project, use a triple store; it's going to provide the best performance and set of features/APIs for working with RDF to build your application.
RDF and SPARQL are standards, so you have a choice of multiple implementations, and can migrate your data from one RDF store to another.
Additionally, version 1.1 of the SPARQL query language is quite sophisticated (more expressive than most SQL implementations) and can do all kinds of queries that would require a lot of code to be written in Neo4J.
If you are going for graph mining (e.g., graph traversal) upon triples, neo4j is a good choice. For the large triples, you might want to use its batchInserter which is fairly fast.
So I think it's all about your use case. Both technologies can and do overlap.
In my mind, there its mostly about the use case. Do you want a full knowledge graph including all the ecosystems from the semantic web? Then go for the triple store.
If you need a general-purpose graph (e.g. store big data as a graph) use the property graph model. My reasoning is, that the underlying philosophy is very much different and this starts with how the data is stored which has implications for your usage scenario.
let's do some out-of-mind bullet points here to compare. Take it with a grain of salt please as this is not a benchmark paper just some experience-based 5 min write down.
Property graph (neo4j):
Think of nodes/Edges as documents
Implemented on top of e.g. linked list, key-value stores (deep searches, large data e.g. via gremlin)
Support for OWL/RDF but not natively (as i see its on a meta layer)
Really great when it comes to having the data in the graph and doing ML (it stores it as linked lists that gives you nice vectors which is cool for ML out of the box)
Made for large data at scale.
Use Cases: (focus is on the data entities and not their classes)
Social Graphs and other scenarios where you need deep traversal
Large data graphs, where you have a lot of documents that need to be searched in a schema-free graph manner .
Analyzing customer funnels from click data etc. You want to move out of your relational schema because actually, you are in a graph use case...
Triple Store (E.g. rdf4j)
Think of data in maximum normal form as triples (no redundant data at all)
Triples are stored in context triples. Works a lot with index.
Broad but searches and specific knowledge extractions. Deep searches are sometimes cumbersome.
Scale is impressive and can scale to trillions of nodes with fast performance. But i would not recommend storing big data in the graph e.g. time-series or so. The reason is the special way how indexes are used and in order to scale horizontally, you may consider working with subgraphs ...
Support for all the ecosystems like SPARQL, SHACL, SWIRL etc. this is a big plus in case
Use cases:
It's really about knowledge graphs. Do you need shape testing, rule evaluation, inference, and reasoning? Go for it because you have to focus on the ontology and class structure!
Also e.g. you have IoT and want to configure relations for logistics and smart factory while the telemetry is stored somewhere else and only referenced in the graph.
I have heard rumors that it takes whole day to load 10M triples into Neo4j (it is actually the slowest one because it's not built primarily for RDF).
Sesame and 4Store are the fastest ones but Jena has powerful API.

Resources