Is rdflib's ConjunctiveGraph with IOMemory store thread-safe? - rdflib

I am pulling multiple datasets from multiple sources and want to parallelize that using the threading library in Python 3.8.2.
All datasets end in different contexts in a ConjunctiveGraph, which uses an IOMemory store.
I saw that a ConcurrentStore exists, but there is no example, not documented that much and does not implement the Store interface, which makes it unusable.
While reading through the code of IOMemory, I saw places where the existence of elements in collections is checked and then things get added without that I saw locking code around.
So is IOMemory already thread-safe or do I need to 'fix' ConcurrentStore?

Is the answer perhaps dependent on whether the underlying Python objects are themselves threadsafe? Python set() underlies the RDFlib Graph.
You could also perhaps create independent Graphs and then merge them all into a ConjunctiveGraph or a Dataset once they have each been fully populated. That may bypass the issue.
Any work on ConcurrentStore would be appreciated though!

Related

Where is the pydata BLAZE project heading?

I find the blaze ecosystem* amazing because it covers most of the data engineering use cases. There was definitely a lot of interest on these projects during the period 2015-2016, but of late it has been ignored. I say this looking at the commits on the github repos.
So my question to the community are
- What happened in 2016 that resulted in lost interest?
- Are there other python based libraries that have replaced blaze?
blaze ecosystem:
Blaze: An interface to query data on different storage systems
Dask: Parallel computing through task scheduling and blocked algorithms
Datashape: A data description language
DyND: A C++ library for dynamic, multidimensional arrays
Odo: Data migration between different storage systems
references:
http://blaze.pydata.org/
I can give some part of the picture, although others were more involved.
Blaze was both an umbrella project for incubating data-engineering ideas into released oss packages, and a package itself focussing on symbolic manipulations of data-frames and translating these into various backend execution engines, particularly database services. Critically, Blaze wanted to be the (start of a) solution for a very broad range of problems! In particular, the translation layer became very large and hard to maintain and by trying to cater to all, limited the range of operations that the symbolic layer could offer.
In terms of an umbrella project, Blaze was a success. Many ideas that started in Blaze percolated into the ecosystem. Probably the most prominent single project to come out of Blaze is Dask, which, while originally planned as an execution layer for Blaze, implements an even larger API of data-frame operations, as well as other high-level collections and arbitrary graph manipulation. Even fully symbolic optimisations exist in Dask, though this is perhaps not as complete. Other Anaconda-stable projects such as numba and bokeh were influenced by the Blaze effort, but I'll not talk about them here.
As far as datashape/dynd go, this is a somewhat crowded space with many other related projects (xnd, uarray, etc) and ideas that can be loosely thought of as "numpy 2" (i.e., more comprehensive, flexible representation of complex data layouts and their description). This has not really been adopted by the community yet, almost everything uses numpy's type system (notable exception of what arrow does internally).
Finally, for data formats and Odo, I encourage you to consider Intake, which can be seem as a successor, which can offer much more functionality such as data source cataloging, and it does this by limiting the scope of operation to read-side. The big web of interactions that is Odo was also a many-to-many problem that became hard to maintain, and by keeping things simpler, Intake is hoping to become the de-facto layer over data loading libraries and the main way to describe location, description and parametrisation of data. Odo is not dead, though, so if file conversion is exactly what you need, you can still use it.
I was looking for a project similar to odo for loading csv data to various sources. An odo issue (https://github.com/blaze/odo/issues/614) recommended d6tstack, which appears to be currently maintained.
In practice, it is often just as easy to roll your own csv loader, in which case the tableschema project is very handy. It automates inferring datatypes from csv files.

master slave exposes technical debt

Using rails and postgresql.
I wrote my app without having in mind to use a master slave configuration.
Now, I've gotten master slave set up in the app and now I'm running into some technical debt. The same process in my app writes to the db and then immediately reads from the db. The read is not taking place on the read db but the data isn't there. Before, this wasn't efficient but it didn't cause any problems because both dbs were the same. Now, this is blowing up in my face.
The problem for me is that its difficult to find all the places in the code where this problem exists. Can someone can please suggest to me a technique to get my tests to run in such a way where the reads and the writes use different dbs that aren't updated so that I can figure out where my issues are?
Other solutions will also be welcomed!
I strongly recommend you rethink your master/slave configuration or whether master/slave is even right for your application.
It's not "tech debt" to build a system that assumes data written to persistent store can be read back immediately. It's normal and correct. While you might reasonably be able to avoid the pattern
write A, ..., look up A.key
with various simple cache schemes, trying to code around e.g.
write A, ..., complex query that *might* fetch A
requires you to retain a copy of A and determine whether it would satisfy the WHERE clause of the query in separate code, simply because you can't rely on the query results. Unless your system is very small and simple, trying to do this system-wide will produce a super-complex, fragile, expensive, and ugly code base. I strongly recommend you don't try it.
The usual purpose of a master/slave persistent store organization is to off-line read traffic that's not time-dependent on writes. For example, if your system mines data to produce summaries accessible to users, you'd offline the metric computation and have it mine the slave. This prevents mining queries from drawing resources away from user request handling. The small delay between write on master and copy to slave is no problem.
If your app is struggling because there's too much load on persistent store, you probably want partitioned data (sometimes called sharding), not master/slave. Partitioning can expose you to a different kind of problem: no cross-partition transactions. But this is usually easier to work through than what you're attempting.
After studying this area, I agree with Gene that master slave should only be used for reads that have been written a significant time before the read.
My ORIGINAL concept was that its better to utilize a functional programming style whereby the process retains all the information in the parameters and then doesn't make recourse to the database. The downside of this approach is that the human mind has a hard time with functional programming and in a massive computer program it makes sense to not insist on this added complication.
If you want to write a functional method or process then that is great and very efficient but there shouldn't be anything in the code that insists on this.

How to use neo4j effectively for serious, repeatable analysis over time

New to neo4J and love the browser for exploratory work. But, I'm unsure of how to best use it to achieve, for lack of a better term, real work. Consider a sample project involving:
Importing 4 different CSV files
Creating appropriate relationships between nodes
Doing a variety of complex queries to derive data that I'll export for statistical analysis using another program.
I need to be able to replicate the project in the future, as well as adding new data, calculating different derived data, etc. I also need to be able to share the code so others can extend/verify it.
For non-relational data, I'd use something like R, Stata or SAS. While each allow interactive exploration like the neo4J browser, I'd never use that for serious analysis. Instead, I'd save a file or files of commands that I could modify and rerun whenever I needed to.
Neo4j's browser doesn't seem to support any of this functionality. Unless I am missing something, it doesn't even allow one to save a "session" along the lines of a iPython/Jupyter notebook. I know that there is a neo4-shell, but especially since they have dropped it from the standard desktop installation (and gotten rid of the console), I feel like I must be doing something wrong--or at least contrary to the designers' intent--if I can't do serious work in the browser. Clearly, lots of people are.
Can anyone point me in the right direction? How does one best develop an extensive, replicable project over time with neo4j? Thank you.
You can take your pick of several officially-supported language drivers to integrate neo4j into basically any other project structure, including Jupyter notebooks. I'm not sure what exactly you mean by "serious work", or where you got the idea that people did lots of it in the browser, but you are definitely able to save the results of a query from the browser in a variety of formats (pictures of the bubbles, result rows in a CSV, JSON response) if your prefer to work that way, or you can pipe data very efficiently into another language and manage it there. I don't see why they would re-create presentation and/or project management tools when there are already so many good ones out there.

What are the dangers in re-using variables in a data source?

Not quite sure if this is on topic, so when in doubt feel free to close.
We have a client who is missing tracking data for a large segment of his visitors in his Report Suite. However the complete set of data is available in a data warehouse. We are now investigating if it is possible to import them as a data source. I have only experience with enriching data via classifications, however the goal here is to create views (sessions etc) for a past timeframe etc from scratch.
According to the documentation this should be possible. However there is one caveat specifically mentioned in the FAQ:
"Adobe recommends you select new, unused variables to import data
using Data Sources. If you are uncertain about the configuration of
your data file, or want to better understand the risks of re-using
variables, contact Customer Care.“
I take that to mean that I should not import data to props,Evars,events etc. that have been used when data has been collected via the tracker, which would pretty much defeat our purpose (basically we want to merge the data from the data warehouse with existing data). Since I have to go to some intermediaries to reach customer care and this takes a long time I wonder if somebody here can explain what the dangers in re-using variables are (and maybe even if there is still a way to do this).
DISCLAIMER: I'm not familiar with Adobe Analytics, but the problem here is pretty universal. If someone with actual experience/knowledge specific to the product comes along, pay more attention to them than me :)
As a rule, Variable reuse in any system runs the risk of data corruption. I'm not familiar with Adobe Analytics, but a brief read through some blogs imply that this is what they're worried about in terms of variable reuse - if you have a variable that is being used in one section, and you import data into it in another section when is in the same scope, you overwrite the data that the other section was using.
Now, that same blog states that provided you have your data structure set up in a specific way, it can allow you to reuse variables/properties without issue and in fact encourages it, hence the statement in your quote "If you are uncertain about the configuration of your data file...". They're probably warning you that if you know what you're doing and know that there won't be any overwriting, fine, go ahead and reuse, but if you don't, or you aren't sure whether something else might be using the original content, then it's unsafe.
Regarding your specific case, you want to merge the two piece of data together, not overwrite, so reusing your existing variables would overwrite the existing data. Sounds like you will need to import to a second (new) set of variables, and then compare/merge between them within the system, rather than trying to import and merge in one go.
We have since received an answer from Adobe Customer Care.
According to them the issue is that hits created via data imports are indistinguishable from hits created via server calls, so they cannot be removed or corrected after they have been imported. They recommend to use different variables so that the original data remains recognizable and salvageable in case one does import faulty data via the imports.
This was already mentioned in the online documentation, but apparently Adobe thinks this is important enough to issue the extra warning.

Is there any free owl reasoner that can reason without load all data into memory?

I use Jena and TDB to store RDF,and I want to do some inference on it.But the RDF data is big,and Jena's owl reasoner has to load all the data into memory .
So I want to find one reasoner that can reason without load all data into memory,is there any one?
Not really. DL reasoning is computationally difficult at even low scale. With lots of data, that's just not going to work with the existing approaches. Doing it over secondary storage is still an open research problem afaik.
However, the various profiles of OWL exist to address this issue. They all have different computational complexities, which are all 'easier' than DL making them much more amenable to reasoning at scale. In particular, QL is designed for query time reasoning which in my experience tends to have a very small memory footprint and RL can be implemented with a standard rule reasoner.
So if you don't need to use DL, then I'd go with a tool that supports one of the profiles and you should get pretty good mileage out of that.
For reference, you might find this document about the compuational complexities of the various OWL dialects interesting.
If you are prepared to take a subset of OWL, there are things you can do in a stream processing fashion without loading all your RDF data in memory and which will materialise all the inferred triples.
As an example, have a look at RIOT's infer command:
http://incubator.apache.org/jena/documentation/io/riot.html#inference
Source code here:
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/riotcmd/infer.java
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/org/openjena/riot/pipeline/inf/InferenceProcessorRDFS.java
https://svn.apache.org/repos/asf/incubator/jena/Jena2/ARQ/tags/jena-arq-2.9.0-incubating/src/main/java/org/openjena/riot/pipeline/inf/InferenceSetupRDFS.java
It is trivial to take RIOT's infer and run it in parallel with something like MapReduce, example is here:
https://github.com/castagna/tdbloader4/blob/f5363fa49d16a04a362898c1a5084ade620ee81b/src/main/java/org/apache/jena/tdbloader4/InferDriver.java
Another different approach which uses MapReduce to apply the RDFS and OWL ter Horst rules and materialize all the derived statements is here:
http://www.few.vu.nl/~jui200/webpie.html
Perhaps, you can look at the parts of OWL you are interested in and check if you can do it in a streaming fashion. If so, you can take RIOT's infer and extend it, adding the parts of OWL you are interested in. That would be a nice contribution to Apache Jena (get back in touch on the jena-dev mailing list if you want to do that).
WebPIE is a clever and interesting project, but as you can see, a bit more complex and it's a research project (with all that this implies from a long-term support and maintenance point of view). However, if it is OWL ter Horst you want/need, WebPIE would do.
You could even put the effort, fork WebPIE and contribute it to an open source project, if others are interested in using it.
You might be interested to look also at Ymris (but this is currently sleeping... zzzzz):
https://svn.apache.org/repos/asf/incubator/jena/Import/Jena-SVN/Ymris/trunk/
You may want to try GRAKN.AI, they perform reasoning in real time on persisted data in distributed systems.

Resources