I'm trying to get my head around two very different approaches to data sharing: OData and Semantic Web/Linked Data. Is there a good comparison of the two?
As I understand it, OData combines syndication/CRUD (AtomPub), serialisation formats (XML, JSON), a data model, a query language, and some semantics/conventions governing use of those existing technologies. It's primarily intended for exposing data from one system so that others can consume it.
Linked Data is a data model, a rigorous commitment to URIs, an (optional?) serialisation format (RDF/XML), but (correct me if I'm wrong) doesn't say anything about transport, CRUD, etc. It seems intended to allow inferencing across lots of little chunks of data drawn from a wide variety of sources. (Not something of major importance to us right now - we would be synchronising large slabs of data between a small number of sources, and wanting to preserve provenance information).
I'm interested in technologies for sharing data between certain data management platforms, some of which I work on directly. OData seems more appealing as it's very straightforward to explain to developers: implement this API, follow that Atom standard, serialise the data like this. We're already doing something very similar for one platform: sharing XML-serialised data on an Atom feed, with URL parameters used to filter.
By contrast, my past experiences working with RDF have given me an impression of brittle, opaque (massive slabs of RDF/XML), inaccessible (using SPARQL vs SQL) technology - but perhaps I'm confusing the experience of working with a triplestore like Jena with simply exposing an existing database via a linked data API.
Any pointers, comments etc on the differences and similarities between these two approaches in terms of scope, technologies, ease, future potential etc would be great.
I think discussing this in depth is not really what Stackoverflow is meant for, but just to give you some pointers to interesting discussions about differences and overlap:
Oh - it is data on the Web
Microsoft, OData and RDF
One of the key differences seems to be that OData has no means to link data from different sources to each other. Essentially, you're still stuck in a silo.
It might also be interesting to check out various attempts to convert data between the two approaches. See a.o. http://answers.semanticweb.com/questions/1298/has-anyone-written-a-mapping-from-odata-to-rdf .
OData may be easier, but its not better, by any means. SPARQL and RDF (forget RDF/XML, better to look at Turtle) satisfies everything in OData along with providing many more cutting edge features such as:
Federation Extensions
Linked Data
Reasoning and Inference (for the more brave)
Equally, the software supporting the standards is actually quite sophisticated. Most people interested in OData generally come from a Microsoft background, so take a look at dotNetRdf
Here's a comparison matrix:
http://uoccou.wordpress.com/2011/02/17/linked-data-odata-gdata-datarss-comparison-matrix/
Unfortunately the table formatting is pretty horrible, but the content is useful.
Related
I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on the github repos.
So my question to the community are
- What happened in 2016 that resulted in lost interest?
- Are there other python based libraries that have replaced blaze?
blaze ecosystem:
Blaze: An interface to query data on different storage systems
Dask: Parallel computing through task scheduling and blocked algorithms
Datashape: A data description language
DyND: A C++ library for dynamic, multidimensional arrays
Odo: Data migration between different storage systems
references:
http://blaze.pydata.org/
I can give some part of the picture, although others were more involved.
Blaze was both an umbrella project for incubating data-engineering ideas into released oss packages, and a package itself focussing on symbolic manipulations of data-frames and translating these into various backend execution engines, particularly database services. Critically, Blaze wanted to be the (start of a) solution for a very broad range of problems! In particular, the translation layer became very large and hard to maintain and by trying to cater to all, limited the range of operations that the symbolic layer could offer.
In terms of an umbrella project, Blaze was a success. Many ideas that started in Blaze percolated into the ecosystem. Probably the most prominent single project to come out of Blaze is Dask, which, while originally planned as an execution layer for Blaze, implements an even larger API of data-frame operations, as well as other high-level collections and arbitrary graph manipulation. Even fully symbolic optimisations exist in Dask, though this is perhaps not as complete. Other Anaconda-stable projects such as numba and bokeh were influenced by the Blaze effort, but I'll not talk about them here.
As far as datashape/dynd go, this is a somewhat crowded space with many other related projects (xnd, uarray, etc) and ideas that can be loosely thought of as "numpy 2" (i.e., more comprehensive, flexible representation of complex data layouts and their description). This has not really been adopted by the community yet, almost everything uses numpy's type system (notable exception of what arrow does internally).
Finally, for data formats and Odo, I encourage you to consider Intake, which can be seem as a successor, which can offer much more functionality such as data source cataloging, and it does this by limiting the scope of operation to read-side. The big web of interactions that is Odo was also a many-to-many problem that became hard to maintain, and by keeping things simpler, Intake is hoping to become the de-facto layer over data loading libraries and the main way to describe location, description and parametrisation of data. Odo is not dead, though, so if file conversion is exactly what you need, you can still use it.
I was looking for a project similar to odo for loading csv data to various sources. An odo issue (https://github.com/blaze/odo/issues/614) recommended d6tstack, which appears to be currently maintained.
In practice, it is often just as easy to roll your own csv loader, in which case the tableschema project is very handy. It automates inferring datatypes from csv files.
I know that this question may not be suitable for SO, but please let this question be here for a while. Last time my question was moved to cross-validated, it froze; no more views or feedback.
I came across a question that does not make much sense for me. How IFC models can be interrogated via NLP? Consider IFC models as semantically rich structured data. IFC defines an EXPRESS based entity-relationship model consisting of entities organized into an object-based inheritance hierarchy. Examples of entities include building elements, geometry, and basic constructs.
How could NLP be used for such type of data? I don't see NLP relevant at all.
In general, I would suggest that using NLP techniques to "interrogate" already (quite formally) structured data like EXPRESS would be overkill at best and a time / maintenance sinkhole at worst. In general, the strengths of NLP (human language ambiguity resolution, coreference resolution, text summarization, textual entailment, etc.) are wholly unnecessary when you already have such an unambiguous encoding as this. If anything, you could imagine translating this schema directly into a Prolog application for direct logic queries, etc. (which is quite a different direction than NLP).
I did some searches to try to find the references you may have been referring to. The only item I found was Extending Building Information Models Semiautomatically Using Semantic Natural Language Processing Techniques:
... the authors propose a new method for extending the IFC schema to incorporate CC-related information, in an objective and semiautomated manner. The method utilizes semantic natural language processing techniques and machine learning techniques to extract concepts from documents that are related to CC [compliance checking] (e.g., building codes) and match the extracted concepts to concepts in the IFC class hierarchy.
So in this example, at least, the authors are not "interrogating" the IFC schema with NLP, but rather using it to augment existing schemas with additional information extracted from human-readable text. This makes much more sense. If you want to post the actual URL or reference that contains the "NLP interrogation" phrase, I should be able to comment more specifically.
Edit:
The project grant abstract you referenced does not contain much in the way of details, but they have this sentence:
... The information embedded in the parametric 3D model is intended for facility or workplace management using appropriate software. However, this information also has the potential, when combined with IoT sensors and cognitive computing, to be utilised by healthcare professionals in Ambient Assisted Living (AAL) environments. This project will examine how as-constructed BIM models of healthcare facilities can be interrogated via natural language processing to support AAL. ...
I can only speculate on the following reason for possibly using an NLP framework for this purpose:
While BIM models include Industry Foundation Classes (IFCs) and aecXML, there are many dozens of other formats, many of them proprietary. Some are CAD-integrated and others are standalone. Rather than pay for many proprietary licenses (some of these enterprise products are quite expensive), and/or spend the time to develop proper structured query behavior for the various diverse file format specifications (which may not be publicly available in proprietary cases), the authors have chosen a more automated, general solution to extract the content they are looking for (which I assume must be textual or textual tags in nearly all cases). This would almost be akin to a search engine "scraping" websites and looking for key words or phrases and synonyms to them, etc. The upside is they don't have to explicitly code against all the different possible BIM file formats to get good coverage, nor pay out large sums of money. The downside is they open up new issues and considerations that come with NLP, including training, validation, supervision, etc. And NLP will never have the same level of accuracy you could obtain from a true structured query against a known schema.
I have to integrate various legacy applications with some newly introduced parts that are silos of information and have been built at different times with varying architectures. At times these applications may need to get data from other system if it exists and display it to the user within their own screens based on the business needs.
I was looking to see if its possible to implement a generic federation engine that kind of abstracts the aggregation of the data from various other OData endpoints and have a single version of truth.
An simplistic example could be as below.
I am not really looking to do an ETL here as that may introduce some data related side effects in terms of staleness etc.
Can some one share some ideas as to how this can be achieved or point me to any article on the net that shows such a concept.
Regards
Kiran
Officially, the answer is to use either the reflection provider or a custom provider.
Support for multiple data sources (odata)
Allow me to expose entities from multiple sources
To decide between the two approaches, take a look at this article.
If you decide that you need to build a custom provider, the referenced article also contains links to a series of other articles that will help you through the learning process.
Your project seems non-trivial, so in addition I recommend looking at other resources like the WCF Data Services Toolkit to help you along.
By the way, from an architecture standpoint, I believe your idea is sound. Yes, you may have some domain logic behind OData endpoints, but I've always believed this logic should be thin as OData is primarily used as part of data access layers, much like SQL (as opposed to service layers which encapsulate more behavior in the traditional sense). Even if that thin logic requires your aggregator to get a little smart, it's likely that you'll always be able to get away with it using a custom provider.
That being said, if the aggregator itself encapsulates a lot of behavior (as opposed to simply aggregating and re-exposing raw data), you should consider using another protocol that is less data-oriented (but keep using the OData backends in that service). Since domain logic is normally heavily specific, there's very rarely a one-size-fits-all type of protocol, so you'd naturally have to design it yourself.
However, if the aggregated data is exposed mostly as-is or with essentially structural changes (little to no behavior besides assembling the raw data), I think using OData again for that central component is very appropriate.
Obviously, and as you can see in the comments to your question, not everybody would agree with all of this -- so as always, take it with a grain of salt.
I use Jena and TDB to store RDF,and I want to do some inference on it.But the RDF data is big,and Jena's owl reasoner has to load all the data into memory .
So I want to find one reasoner that can reason without load all data into memory,is there any one?
Not really. DL reasoning is computationally difficult at even low scale. With lots of data, that's just not going to work with the existing approaches. Doing it over secondary storage is still an open research problem afaik.
However, the various profiles of OWL exist to address this issue. They all have different computational complexities, which are all 'easier' than DL making them much more amenable to reasoning at scale. In particular, QL is designed for query time reasoning which in my experience tends to have a very small memory footprint and RL can be implemented with a standard rule reasoner.
So if you don't need to use DL, then I'd go with a tool that supports one of the profiles and you should get pretty good mileage out of that.
For reference, you might find this document about the compuational complexities of the various OWL dialects interesting.
If you are prepared to take a subset of OWL, there are things you can do in a stream processing fashion without loading all your RDF data in memory and which will materialise all the inferred triples.
As an example, have a look at RIOT's infer command:
http://incubator.apache.org/jena/documentation/io/riot.html#inference
Source code here:
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/riotcmd/infer.java
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/org/openjena/riot/pipeline/inf/InferenceProcessorRDFS.java
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/org/openjena/riot/pipeline/inf/InferenceSetupRDFS.java
It is trivial to take RIOT's infer and run it in parallel with something like MapReduce, example is here:
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/main/java/org/apache/jena/tdbloader4/InferDriver.java
Another different approach which uses MapReduce to apply the RDFS and OWL ter Horst rules and materialize all the derived statements is here:
http://www.few.vu.nl/~jui200/webpie.html
Perhaps, you can look at the parts of OWL you are interested in and check if you can do it in a streaming fashion. If so, you can take RIOT's infer and extend it, adding the parts of OWL you are interested in. That would be a nice contribution to Apache Jena (get back in touch on the jena-dev mailing list if you want to do that).
WebPIE is a clever and interesting project, but as you can see, a bit more complex and it's a research project (with all that this implies from a long-term support and maintenance point of view). However, if it is OWL ter Horst you want/need, WebPIE would do.
You could even put the effort, fork WebPIE and contribute it to an open source project, if others are interested in using it.
You might be interested to look also at Ymris (but this is currently sleeping... zzzzz):
https://svn.apache.org/repos/asf/incubator/jena/Import/Jena-SVN/Ymris/trunk/
You may want to try GRAKN.AI, they perform reasoning in real time on persisted data in distributed systems.
I'm currently learning F# and I'm exploring using it to analyse financial time-series. Can anyone recommend a good data structure to store time-series data in?
F# offers a rich selection of native types and I'm looking for a some simple combination that would provide an elegant, succinct and efficient solution.
I'm looking store tick data, which consists of millions of records each with a time stamp, and several (~5-20) fields of numerical and textual data, with possible missing values.
My first thoughts are perhaps a sequence of tuples or records, but I was wondering if someone could kindly suggest something that has worked well in the real world.
EDIT:
A few extra points for clarification:
The common operations that I'm likely to require are:
Time based lookup - i.e. find the most recent data point at a given time
Time based joins
Appends
(Updates and deletes are going to be rare. )
I should make it clear I'm exploring using F# primarily as an interactive tool for research, with the ability to compile as a (really big) added bonus.
ANOTHER EDIT:
I should also have mentioned, my role/use of F# and this data is purely within research not development. The intention being that once we understand the data (and what we want to do with it) better then we can later specify tools that our developers would build. Such as data warehouses etc. at which we'd start using their data structures etc.
Although, I am concerned that our models are computationally intensive, use a lot of memory and can't always be coded in a recursive manner. So we many end up having to query out large chunks anyway.
I should also say that I've always used Matlab or R for these sorts of tasks before but I'm now interested in F# as it offers that interactive, high level flexibility for Research but the same code can be used in production.
My apologies for not giving this context information at the start (It's my first question), I can see now that it helps people form their answers.
My thanks again to everyone that's taken the time to help me.
It really sounds like your data should be stored and queried in a relational database (where is it currently stored?: loading millions of records with several fields into memory must be an expensive operation, and could leave you with stale data and difficulty persisting changes). And then you could use the F# LINQ to SQL implementation (which I believe you can find in the Power Pack) to have F# expressions translated to SQL expressions.
Here's a link from Don Syme about LINQ Support in F# Power Pack: http://blogs.msdn.com/b/dsyme/archive/2009/10/23/a-quick-refresh-on-query-support-in-the-f-power-pack.aspx
The best choice of data structure depends upon what operations you want to do on it.
The simplest would be an array of structs. This has the advantages of fast random lookup, good space efficiency for an uncompressed representation and good locality. If there is sharing between substructures (like the strings) then intern them to make sure they get shared.
Alternatives might be a seq that is loaded from disk on-demand, a singly-linked list that allows you to prepend elements quickly or a balanced binary trees that allows operations like insertion at random locations efficiently.