mahout taste: in-memory DataModel that supports set/removePreference - mahout

I could not find an in-memory DataModel that supports setPreference / removePreference / refresh. Instead, they all "recreate" a new GenericDataModel anytime one adds or remove preferences. Am I missing something? Shall I build my own?

You aren't missing anything; they generally don't support this operation. I suppose the overall idea is that changing data means recomputing a lot of other things and so should happen via bulk reloads of some underlying data source.

Related

Syncing of memory and database objects upon changes in objects in memory

I am currently implementing a web application in .net core(C#) using entity framework. While working on the project, I actually encountered quite a few challenges but I will start with the one which I think are most important. My questions are as follows:
Instead of frequent loading data from the database, I am having a set of static objects which is a mirror of the data in the database. However, it is tedious and error prone when I want to ensure any changes, i.e., adding/deleting/modifying of objects are being saved to the database at real time. Is there any good example or advice that I can refer to improve my approach to do this?
Another thing is that value of some objects' properties will be changed on the fly according to the value of some other objects' properties. Something like a spreadsheet where a cell's value will be changed automatically if the value in the cell that the formula is referring to changes. I do not have a solution to do this yet. Appreciate if anyone has any example that I can refer to. But this will add another layer of complexity to sync the changes of the objects in memory to database.
At the moment, I am unsure if there is any better approach. Appreciate if anyone can help. Thanks!
Basically, you're facing a problem that's called eventual consistency. Something changes and two or more systems need to be aware at the same time. The problem here is that both changes need to be applied in order to consider the operation successful. If either one fails, you need to know.
In your case, I would use the Azure Service Bus. You can create queues and put messages on a queue. An Azure Function would handle these queue messages. You would create two queues, one for database updates, and one for the in-memory update (I think changing this to a cache service may be something to think off). Now the advantage of these queues is that you can easily drop messages on these queues from anywhere. Because you mentioned the object is going to evolve, you may need to update these objects either in the database or in memory (cache).
Once you've done that, I'd create a topic, with two subscriptions. One forwarding messages to Queue 1, and the other to Queue 2. This will solve your primary problem. In case an object changes, just send it to the topic. Both changes (database and memory) will be executed automagically.
The only problem you have now, it that you mentioned you wanted to update the database in real-time. With this scenario, you're going to have to leave that.
Also, you need to make sure you have proper alerts in place for the queues so in case you did miss a message, or your functions didn't handle it well enough, you'll receive an alert to check & correct errors.
I'm totally agree with #nineedm's and answer, but there are also other solutions.
If you introduce cache, you will always face cache revalidation problem - you have to mark cache as invalid when data were changed. Sometimes it is easy, depending on nature of cached data and how often data are changed.
If you have just single application, MemoryCache can be enough with proper specified expiration options.
If there is a cluster - you have to look at Distributed Cache solutions, for example Redis. There is MS article about that Distributed caching in ASP.NET Core

Incremental Saving

I have an app with a lot of data (including NSMutableArrays, NSNumbers, various custom classes) that uses NSCoding protocol presently. However, I would like to implement an incremental saving system, to save time during the "saving process". The loading time is not important.
Is there any existing container that checks its members for "dirty" and only updates those values when writing to file; or better yet, a protocol that can be implemented to do the same; or any other simple, available way of doing this?
For large amount of data its better to change data model to Core Data. Otherwise, you may want to save changes after specific events, or, bad solution is to use NSTimer, to save all data every time you want to.

Core Data vs NSFileManager

The problem:
I have been using for some time now my own cache system by using NSFileManager. Normally the data I receive is JSON and I just save the dictionary directly into cache (in the Documents Folder). When I need it back I will just go get it. I also implement, sometimes when I feel it's better, a NSDictionary on the root Folder with keys/values for the path for a given resource. For example:
Resource about weather in Geneve 17/02/2013, so I would have a key called GE_17_02_2013 and the value the path to the NSDictionary with the information.
Normally I don't need to do any complex queries. But, somehow, and from what I have been reading, when you have a lot of data, you should stick with Core Data. In my case, I normally have a lot of data, but I never actually felt the application going down, or suffering in terms of performance. So my questions are:
In this case, where sometimes (the weather thing was just an
example) I need to just remove all the data (a Twitter feed, for
example) and replace it by a completely new stream of data, is Core
Data worth? I think removing all the data, and inserting (populating) it, is heavier than just store the NSDictionary and replacing the old one.
Sometimes it would envolve images, textFiles, etc and the
NSFileManager does it perfectly, so what advantages could Core
Data bring in this cases?
P.S: I just saw this post, where this kind of question is made and numbers prove which one is actually faster. Still, I would like as well an empiric answer.
Core Data is worth using in every scenario you described. In fact, if an app stores more than preferences, you should probably use Core Data. Here are some reasons, among which, you'll find answers to your own problems:
is definitely faster than the filesystem, even if you wipe out everything and write it again as you describe (so you don't benefit to much from caching). This is basically because you can coalesce your operations and only access the store when needed. So if you read, write and read, you can save only once, the rest is done in memory, which is, needless to say, very fast.
everything is versioned and you can migrate from one version to another easily (while keeping the content the user has on the device)
80% of your model operations come free. Like, when something changes, you can override the willSave managed object method and notify your controllers.
using cascade makes it trivial to delete even very complex object structures
while is a bad idea to keep images in the database, you can still keep them on the filesystem and have core data delete them automatically when the managed object that represents them is deleted
is flexible, in fact is so flexible that you could migrate your project from using the local filesystem to using a server with very little modifications by writing a custom data store.
the core data designer basically creates the model objects for you. You don't need to create your own model classes (which you would have to if using the filesystem)
In this case ... is Core Data worth it?
Yes, to the extent that you need something more centrally managed than trying to draw up your own file-system schema. Core Data, and its lower-level cousin SQL, are still the best choice for persistence that we have right now. Besides, the performance hit of using NSKeyed(Un)Archiver to keep serializing/deserializing a dictionary over and over again becomes increasingly noticeable with larger datasets.
I think removing all the data, and inserting (populating) it, is heavier
than just store the NSDictionary and replacing the old one.
Absolutely, yes. But, you don't have to think about cache turnover like that. If you imagine Core Data as a static model, you can use a cache layer to ferry data in and out of the store. Need that resource about the weather? Check the cache layer. If it's not in there, make the cache run a fetch request. Need to turn over the whole cache? Have the cache empty itself then run a request to mark every entity with some kind of flag to show they are invalid. The expensive deletion you're worrying about can be done by a background process when you see that all your new data has been safely interned in the cache.
Sometimes it would envolve images, textFiles, etc and the
NSFileManager does it perfectly, so what advantages could Core Data
bring in this cases?
Unfortunately, not many. For blobs of data (which is essentially what Core Data does in these situations), storage and fetches to and from Core Data can quickly get costly. They can also take up a noticeably larger space on disk if they aren't compressed (which further decreases performance). If you need a faster alternative, use a store more suited to the task like Tokyo Cabinet or LevelDB, and use the entities in the Core Data store as a kind of stand-in that would, say, contain the key to the resource in one of those relational databases.

How to revert changes with procedural memory?

Is it possible to store all changes of a set by using some means of logical paths - of the changes as they occur - such that one may revert the changes by essentially "stepping back"? I assume that something would need to map the changes as they occur, and the process of reverting them would thus ultimately be linear.
Apologies for any incoherence and this isn't applicable to any particular language. Rather, it's a problem of memory – i.e. can a set * (e.g. which may be some store of user input)* of a finite size that's changed continuously * (e.g. at any given time for any amount of time - there's no limit with regards to how much it can be changed)* be mapped procedurally such that new - future - changes are assumed to be the consequence of prior change * (in a second, mirror store that can be used to revert the state of the set all the way to its initial state)*.
You might want to look at some functional data structures. Functional languages, like Erlang, make it easy to roll back to the earlier state, since changes are always made on new data structures instead of mutating existing ones. While this feature can be used at repeatedly internally, Erlang programming typically uses this abundantly at the top level of a "process" so that on any kind of failure, it aborts both processing as well as all the changes in their entirety simply by throwing an exception (in a non-functional language, using mutable data structures, you'd be able to throw an exception to abort, but restoring originals would be your program's job not the runtime's job). This is one reason that Erlang has a solid reputation.
Some of this functional style of programming is usefully applied to non-functional languages, in particular, use of immutable data structures, such as immutable sets, lists, or trees.
Regarding immutable sets, for example, one might design a functionally-oriented data structure where modifications always generate a new set given some changes and an existing set (a change set consisting of additions and removals). You'd leave the old set hanging around for reference (by whomever); languages with automatic garbage collection reclaim the old ones when they're no longer being used (referenced).
You can put a id or tag into your set data structure, this way you can do some introspection to see what data structure id someone has a hold of. You also can capture the id of the base off of which each new version was generated; this gives you some history or lineage.
If desired, you can also capture a reference to the entire old data structure in the new one, or, one can maintain a global list of all of the sets as they are being generated. If you do, however, you'll have to take over more responsibility for storage management, as an automatic collector will probably not find any unused (unreferenced) garbage to collect without additional some help.
Database designs do some of this in their transaction controllers. For the purposes of your question, you can think of a database as a glorified set. You might look into MVCC (Multi-version Concurrency Control) as one example that is reasonably well written up in literature. This technique keeps old snapshot versions of data structures around (temporarily), meaning that mutations always appear to be in new versions of the data. An old snapshot is maintained until no active transaction references it; then is discarded. When two concurrently running transactions both modify the database, they each get a new version based off the same current and latest data set. (The transaction controller knows exactly which version each transaction is based off of, though the transaction's client doesn't see the version information.) Assuming both concurrent transactions choose to commit their changes, the versioning control in the transaction controller recognizes that the second committer is trying to commit a change set that is not a logical successor to the first (since both changes sets as we postulated above were based on the same earlier version). If possible, the transaction controller will merge the changes as if the 2nd committer was really working off the other, newer version committed by the first committer. (There are varying definitions of when this is possible, MVCC says it is when there are no write conflicts, which is a less-than-perfect answer but fast and scalable.) But if not possible, it will abort the 2nd committers transaction and inform the 2nd committer thereof (they then have the opportunity, should they like, to retry their transaction starting from the newer base). Under the covers, various snapshot versions in flight by concurrent transactions will probably share the bulk of the data (with some transaction-specific change sets that are consulted first) in order to make the snapshots cheap. There is usually no API provided to access older versions, so in this domain, the transaction controller knows that as transactions retire, the original snapshot versions they were using can also be (reference counted and) retired.
Another area this is done is using Append-Only-Files. Logging is a way of recording changes; some databases are based 100% on log-oriented designs.
BerkeleyDB has a nice log structure. Though used mostly for recovery, it does contain all the history so you can recreate the database from the log (up to the point you purge the log in which case you should also archive the database). Again someone has to decide when they can start a new log file, and when they can purge old log files, which you'd do to conserve space.
These database techniques can be applied in memory as well. (Nothing is free, though, of course ;)
Anyway, yes, there are fields where this is done.
Immutable data structures help preserve history, by simply keeping old copies; changes always go to new copies. (And efficiency techniques can make this not as bad as it sounds.)
Id's can help understand lineage without necessarily holding onto all the old copies.
If you do want to hold onto all old the copies, you have to look at your domain design to understand when/how/if old data structures possibly can get accessed with an eye toward how to eventually reclaim them. You'll mostly likely have to help get involved in defining how they get released, if ever. Or how they get archived for posterity though at the cost of slower access later.

Optimal way of syncing Core Data with server-side data?

I have what I would presume is a very common situation, but as I'm new to iOS programming, I'm not sure of the optimal way to code it.
Synopsis:
I have data on a server which can be retrieved by the iPhone app via a REST service. On the server side, the data is objects with a foreign key (an integer id number).
I'm storing the data retrieved via REST in Core Data. The managed objects have an "objId" attribute so that I can uniquely identify the managed objects in the rest of my code.
My app must always reflect the server data.
On subsequent requests made to the server:
some objects may not be returned, they have been deleted on the server - in which case I need to delete the corresponding objects from Core Data - so that I'm reflecting the state of the server correctly.
some objects have attributes which have changed, therefore the corresponding managed objects need updating with the new data.
my solution - and question to you
To get things going in my app, I made the easiest solution of deleting all objects in Core Data, then adding all new objects in, created with the latest server side data.
I don't think this is the best way to approach it :) As I progress on with my app, I now want to link up my tableview with NSFetchedResultsController, and have realised that my approach of deleting everything and re-adding is not going to work any more.
What is the tried and trusted way of syncing Core Data with server side data?
Do I need to make a fetch request for each object id I get back from the server, and then update the object with the new data?
And then go through all of the objects in core data and see which ones have not been updated, and delete those?
Is that the best way to do it? It just seems a little expensive to do a fetch for each object in Core Data, that's all.
Pseudo code is fine for any answers :)
thanks in advance!
Well, consider your download. First, you should be doing this in a background thread (if not, there are lots of SO posts that talk about how to do that).
I would suggest that you implement what makes sense first, and then, after you can get valid performance data from running Instruments, consider performance optimization. Of course, use some common sense on "easy" performance stuff (your design can take care of the big ones easily enough).
Anyway, get your data from the online resource, and then, for each object fetched, use the "unique object id" to fetch the object from core data. You know there is only one object with that ID, so you can set fetchLimit to 1 on your fetch request. You can also configure your "object id" attribute to be an INDEX in the database. This way, you get the fastest search from the underlying database, and it knows to stop looking once it finds your one object. This should be pretty snappy.
Now you have your object. Change any attributes necessary. Save, rinse, and repeat.
Furthermore, for several reasons, you may want to know when objects were last updated. I'd suggest adding a timestamp to each object that gets changed with the current time every time an object is changed. This will also help in deleting objects. Since your online database does not tell you which objects are deleted, you must have some way to know that an item is "old and no longer needed."
An easy way to do this is to remember the time you started your update. After processing all objects from the download, you now have a way to find all the objects that were deleted from the online database. Basically, any object with a "last update" timestamp before the time you began the update should be removed (since they were not added or modified in the last update). You can also index the database on this field, which will make finding those objects faster - unless your database is huge, I'd wait to see what Instruments has to say about this one though.

Resources