Can Apache Jena support soft deletion? - jena

Can Apache Jena support soft deletion?

Under the assumption that what you mean by soft deletion is that triples remain in the model but are not returned by API calls or SPARQL queries, then, no, there's no support for that in Jena. What I would do, in this circumstance, is keep a separate model for "deleted" triples to be stored in, so that you can add them back during an 'undelete' operation. The only thing to be careful of in this context is b-nodes. Also, note that the only thing you can delete from a Jena model is a triple: if you are thinking in terms of the resources in your model, to delete them you need to remove all of the triples that mention that model.
If that's not what you mean by soft deletion, you'll need to say more.

Related

How to update parsed data in rdflib ConjunctiveGraphs?

I am merging RDF data from various remote sources using ConsecutiveGraph.parse().
Now I would like to have a way to update the data of individual sources, without affecting the other ones and the relations between them.
The triples from the various sources might overlap, so it has to be ensured that only the triples coming from a specific source get deleted before the update.
Each graph in a ConjunctiveGraph has its own ID - whether you explicitly set it or not. Just update the particular graph you want and export them individually.
If you want to do something more complex than this, such as keeping track of where new data you’ve created perhaps in the default graph (the unnamed graph you get automatically), you’re going to need to use some other method of tracking triples. Look up “reification” for how to annotate triples with more information.

nosql dynamic fields in rails

I'm writing a rails application that has a user document which has about 20 different attributes. Each time an attribute is updated, I need to store it in a transactions document which will have who changed, which attribute was changed and the old value and new value of the attribute.
Does it make sense to have a separate document to store transactions? or should I use a noSQL DB like CouchDB which supports versioning by default and then I don't have to worry about creating a transactions document.
If I do decide to create a transaction document, then the key of the document will be dynamic.
When I need to pull history, I can pull out all versions of a document and dynamically figure it out?
I would not store all transactions for a given user in a single document. This document will become very large and it may begin to take a up a lot of memory when you have to bring it into memory. It might also be difficult to query on various transactions (i.e. find all transactions for a given user that modified the name attribute).
The Versioning in CouchDB and similar NoSQL databases is a little bit misleading, and I tapped into the same mistake as you just did before. Versioning simply means - optimistic concurrency. This means that if you want to update a document, you need to provide the old version number with it, to be sure that nothing has been overwritten. So when you get a document and in the meanwhile someone else changes it, your version number is out of date (or out of sync) and you need to read it again and apply the required changes before submitting it to the database. Also some NoSQL stores allow you to ignore this versioning, while others (like CouchDB) enforce it.
Back to topic - Versioning won’t do what you want, you are rather looking for a log store with write often, read seldom (as I assume, you won’t read the history that often). In that case Cassandra is perfect for this, if you require a high throughput, but any other NoSQL or SQL DB might do the job as well, depending on your performance requirements.

Entity, dealing with large number of records (> 35 mlns)

We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.

How to start building a searchable garbage collector in Delphi (2009-2010)

I'm looking for a way to control all business objects I create in my applications written in Delphi.
As an article on Embarcadero's EDN (http://edn.embarcadero.com/article/28217) states, there are basically three ways to do this. I'm mostly interested in the last one, using interfaces. That way, when the business object is no longer being referenced anywhere in the application, it will be dispose of memory wise (I'll get back on this part later).
When creating a new business object, it would be wise to ask that new object manager whether I already fetched it earlier in the program, thus avoid the need to refetch it from the database. I already have the business object in memory, so why not use that one? Thus I'll need the list of available objects in memory to be searchable (fast).
The provided code there uses an "array of TObject" to store the collected objects, which doesn't make it very performant concerning searches through the list of objects once you get to a certain amount. I would have to change that to either TObjectList or some sort of binary searchable tree. What would be the best choice here? I already found some useful code (I think) at http://www.ibrtses.com/delphi/binarytree.html. Didn't JCL have stuff on binary trees?
How would I handle "business objects" and "lists of business objects" in that tree? Would a business object, being part of a list be referenced twice in the tree?
Concerning the disposing of an object: I also want to set some sort of TTL (time to life) to that business object, forcing a refetch after an certain amount of time.
Should the reference counter fall to 0, I still want to keep the object there for a certain amount of time, should the program still want it within the TTL. That means I'll need sort sort of threaded monitor looping the object list (or tree) to watch for to-be-deleted objects.
I also came across the Boehm Garbage Collector DLL (http://codecentral.embarcadero.com/Download.aspx?id=21646).
So in short, would it be wise to base my "object manager" on the source code provided in the EDN article? What kind of list would I want to store my objects in? How should I handle list of objects in my list? And should I still keep my object in memory for a while and have it dispose of by a threaded monitor?
Am I correct in my reasoning? Any suggestions, ideas or remarks before I start coding? Maybe some new ideas to incorporate into my code?
Btw, I'd be happy to share the result, for others to benefit, once some brilliant minds gave it a thought.
Thnx.
If you are using Interfaces for reference counting, and then stick those in a collection of some sort, then you will always have a reference to them. If your objective is "garbage collection" then you only need one or the other, but you can certainly use both if necessary.
What it sounds like you really want is a business object cache. For that you will want to use one of the new generic TDictionary collections. What you might want to do is have a TDictionary of TDictionary collections, one TDictionary for each of your object types. You could key your main TDictionary on an enumeration, or maybe even on the type of the object itself (I haven't tried that, but it might work.) If you are using GUIDs for your unique identifiers then you can put them all in a single TDictionary.
Implement each of your business objects with an interface. You don't need to use Smart Pointers since you are designing your business objects and can descend them from TInterfacedObject. Then only reference it by its interface, so it can be reference counted.
If you want to expire your cache then you will need to have some sort of timestamp on your objects that gets updated each time an object is retrieved from the cache. Then when the cache gets over some specific size you can prune everything older then a certain timestamp. Of course that requires walking the entire cache to do that.
Since you are combining interfaces and a collection then if you have a reference to an object (via its interface), and it gets pruned during cache cleanup, then the object will remain alive until the reference goes away. This provides you an additional safety. Of course if you are still using the reference, then that means you kept the reference for a long time without retrieving it from the cache. In that case you may want to update the timestamp when you read or write to the properties too . . . A lot of that depends on how you will be using the business objects.
As far as refetching, you only want to do that if an object is retrieved from the cache that is older then the refetch limit. That way if it gets pruned before you use it again you are not wasting database trips.
You might consider just having a last modified time in each table. Then when you get an object from the cache you just check the time in memory against the time in the database. If the object has been changed since it was last retrieved, you can update it.
I would limit updating objects only to when they are being retrieved from the cache. That way you are less likely to modify the object while it is use. If you are in the middle of reading data from an object while it changes that can produce some really odd behavior. There are a few ways to handle that, depending on how you use things.
Word of warning about using interfaces, you should not have both object and interfaces references to the same object. Doing so can cause trouble with the reference counting and result in objects being freed while you still have an object reference.
I am sure there will be some feedback on this, so pick what sounds like the best solution for you. . . .
Of course now that I have written all of this I will suggest you look at some sort of business object framework out there. RemObjects has a nice framework, and I am sure there are others.
You might want to start by looking at smart pointers. Barry kelly has an implimentation for D2009.
For your BI objects, I would use a guid as the key field, or an integer that is unique across the database. As you load objects into memory, you could store them in a dictionary using the guid as the key and a container object as the value.
The container object contains the bi object, the ttl etc.
Before loading an object, check the dictionary to see if it is already there. If it is there, check the ttl and use it, or reload and store it.
For very fast "by name" lookup in your container object, I suggest you look not to trees, but to hash tables. The EZ-DSL (Easy Data Structures) library for Delphi includes the EHash.pas unit, which I use for my hashing containers. If you are interested in that, I will forward it to you.
I tend to think of "lifetime" oriented "containers" that delete all the objects they own when they close down.
I also think that you might consider making the objects themselves count their "usefulness", by having a "FLastUsed:Cardinal" data field. You can assign FLastUsed := GetTickCount and then have the system basically obey limits that you set up, maximum memory or instances to be actively used and how old an object should be (not used at all in X milliseconds) before it gets "demoted" to level 2, level 3, etc.
I am thinking that for Business Objects, you have a memory cost (keep it around when it could be really only a cache), and a coherency (flush my changes to the database before you can destroy me) constraint that makes traditional "garbage collection" ideas necessary, but not sufficient, for the whole problem of business object lifetime management.
For a higher level point of view, I recommend two books: "Patterns of Enterprise Application Architecture" by Martin Fowler and Eric Evans book about Domain-Driven Design, "Domain-Driven Design: Tackling Complexity in the Heart of Software" (http://dddcommunity.org/books#DDD). Martin Fowler explains all patterns for object management for business applications, including object repositories and different object / database mapping strategies. Eric Evans book "...provides a broad framework for making design decisions...".
There are aleady some (open source) O/R mapper libraries for Delphi with active community, maybe their source code also can provide some technical guidelines.
You are looking for a way to build a container that can store business objects and able to find already instanciated ones a runtime ?
Maybe you could give a look at http://www.danieleteti.it/?p=199 (Inversion of Control and Dependency Injection patterns.)

Should is_paranoid be built into Rails?

Or, put differently, is there any reason not to use it on all of my models?
Some background: is_paranoid is a gem that makes calls to destroy set a deleted_at timestamp rather than deleting the row (and calls to find exclude rows with non-null deleted_ats).
I've found this feature so useful that I'm starting to include it in every model -- hard deleting records is just too scary. Is there any reason this is a bad thing? Should this feature be turned on by default in Rails?
Ruby is not for cowards who are scared of their own code!
In most cases you really want to delete the record completely. Consider a table that contains relationships between two other models. This is an obvious case when you would not like to use deleted_at.
Another thing is that your approach to database design is kinda rubyish. You will suffer of necessity to handle all this deleted_At stuff, when you have to write more complex queries to your tables than mere finds. And you surely will, when your application's DB takes lots of space so you'll have to replace nice and shiny ruby code with hacky SQL queries. You may want then to discard this column, but--oops--you have already utilized deleted_at logic somewhere and you'll have to rewrite larger pieces of your app. Gotcha.
And at the last place, actually it seems natural when things disappear upon deletion. And the whole point of the modelling is that the models try to express in machine-readable terms what's going on there. By default you delete record and it passes forever. And only reason deleted_at may be natural is when a record is to be later restored or should prevent similar record to be confused with the original one (table for Users is most likely the place you want to use it). But in most models it's just paranoia.
What I'm trying to say is that the plausibility to restore deleted records should be an explicitly expressed intent, because it's not what people normally expect and because there are cases where implicit use of it is error prone and not just adds a small overhead (unlike maintaining a created_at column).
Of course, there is a number of cases where you would like to revert deletion of records (especially when accidental deletion of valuable data leads to an unnecessary expenditure). But to utilize it you'll have to modify your application, add forms an so on, so it won't be a problem to add just another line to your model class. And there certainly are other ways you may implement storing deleted data.
So IMHO that's an unnecessary feature for every model and should be only turned on when needed and when this way to add safety to models is applicable to a particular model. And that means not by default.
(This past was influenced by railsninja's valuable remarks).
#Pavel Shved
I'm sorry but what? Ruby is not for cowards scared of code? This could be one of the most ridiculous things I have ever heard. Sure in a join table you want to delete records, but what about the join model of a has many through, maybe not.
In Business applications it often makes good sense to not hard delete things, Users make mistakes, A LOT.
A lot of your response, Pavel, is kind of dribble. There is no shame in using SQL where you need to, and how does using deleted_at cause this massive refactor, I'm at a loss about that.
#Horace I don't think is_paranoid should be in core, not everyone needs it. It's a fantastic gem though, I use it in my work apps and it works great. I am also fairly certain it hasn't forced me to resort to sql when I wouldn't need to, and I can't see a big refactor in my future due to it's presence. Use it when you need it, but it should be a gem.

Resources