With a new version of Spring Data Neo4j I can't use Neo4jHelper.cleanDb(db);
So, what is the most effective way to completly clear Embedded Neo4j database in my application?
I have implemented my own util method for this purpose, but this method is slow:
public static void cleanDb(Neo4jTemplate template) {
template.query("MATCH (n) OPTIONAL MATCH (n)-[r]-() DELETE n,r", null);
}
How to properly clear/delete database ?
UPDATED
This is the similar question How to reset / clear / delete neo4j database? but I don't know how to programmatically shutdown Embedded Neo4j and how to start it after deleting.
I use Spring Data Neo4j and based on the user request I'd like to clear/delete existing database and recreate it with a new data. How to start up new embedded database after suggested invocation of shutdown method ?
USE CASE:
On the working application I have configured embedded database:
GraphDatabaseService graphDb = new GraphDatabaseFactory()
.newEmbeddedDatabaseBuilder(environment.getProperty(NEO4J_EMBEDDED_DATABASE_PATH_PROPERTY))
.setConfig(GraphDatabaseSettings.node_keys_indexable, "name,description")
.setConfig(GraphDatabaseSettings.node_auto_indexing, "true")
.newGraphDatabase();
Also, I pre populate this database with 1000000 nodes. On the user request I need to clear this database and populate it with a new data. How to correctly and quick clear existing database ?
Can I call Neo4j database API for new node creation after database.shutdown() or do I need to initialize new database before it?
See the other answer on the related question. Inside of java, you can shut down an embedded database with the GraphDatabaseService#shutdown() method.
From there, there are a pile of different ways you can delete the underlying directory, see this other answer.
So the general answer can still be the same:
Shutdown the database using the neo4j java API
Delete the database contents off of the disk
In our Grails application, we use one database per tenant. Each database contains data of all domain objects. We switch the databases based on the request context using pattern like DomainClass.getDS().find().
The transactions do not work out of the box since the request does not know what transaction manager to use. The #Transactional and withTransaction do nothing.
I have implemented my own version of withTransaction():
public static def withTransaction(Closure callable) {
new TransactionTemplate(getTM()).execute(callable as TransactionCallback)
}
The getTM() returns transaction manager based of the request context, for example transactionManager_db0.
This seems to work in my sandbox. Will this also work:
For parallel requests?
If the database has several replicas?
For hierarchical transactions?
I believe in TDD, but with the exception of the last bullet, I have hard times to provide tests to verify the code.
I have an hierarchical structure with millions of records.
I'm doing a recursive scan on the DB in order to update some of the connections and some of the data.
the problem is that I get an outofmemory exception since the entire DB is eventually loaded to the context (lazy). data that I no longer need stays in the context without any way of removing it.
I also can't use Using(context...) since I need the context alive because I'm doing a recursive scan.
Please take the recursion as a fact.
Thanks
This sort of an operation is really not handled well nor does it scale well using entities. I tend to resort to stored procedures for batch ops.
If you do want to remove/dump objects from context, I believe this post has some info (solution to your problem at the bottom).
just ran into the same problem. I've used NHibernate before I used EF as ORM tool and it had exactly the same problem. These frameworks just keep the objects in memory as long as the context is alive, which has two consequences:
serious performance slowdown: the framework does comparisons between the objects in memory (e.g. to see if an object exists or not). You will notice a gradual degradation of performance when processing many records
you will eventually run out of memory.
If possible I always try to do large batch operation on the database using pure SQL (as the post above states clearly), but in this case that wasn't an option. So to solve this, what NHibernate has is a 'Clear' method on the session, which throws away all object in memory that refer to database records (new ones, added ones, corrupt ones...)
I tried to mimic this method in entity framework as follows (using the post described above):
public partial class MyEntities
{
public IEnumerable<ObjectStateEntry> GetAllObjectStateEntries()
{
return ObjectStateManager.GetObjectStateEntries(EntityState.Added |
EntityState.Deleted |
EntityState.Modified |
EntityState.Unchanged);
}
public void ClearEntities()
{
foreach (var objectStateEntry in GetAllObjectStateEntries())
{
Detach(objectStateEntry.Entity);
}
}
}
The GetAllObjectStateEntries() method is taken separately because it's useful for other things. This goes into a partial class with the same name as your Entities class (the one EF generates, MyEntities in this example), so it is available on your entities instance.
I call this clear method now every 1000 records I process and my application that used to run for about 70 minutes (only about 400k entities to process, not even millions) does it in 25mins now. Memory used to peak to 300MB, now it stays around 50MB
I'm wondering what the most efficient way of updating a single value in a domain class from a row in the database. Lets say the domain class has 20+ fields
def itemId = 1
def item = Item.get(itemId)
itemId.itemUsage += 1
Item.executeUpdate("update Item set itemUsage=(:itemUsage) where id =(:itemId)", [usageCount: itemId.itemUsage, itemId: itemId.id])
vs
def item = Item.get(itemId)
itemId.itemUsage += 1
itemId.save(flush:true)
executeUpdate is more efficient if the size and number of the un-updated fields is large (and this is subjective). It's how I often delete instances too, running 'delete from Foo where id=123' since it seems wasteful to me to load the instance fully just to call delete() on it.
If you have large strings in your domain class and use the get() and save() approach then you serialize all of that data from the database to the web server twice unnecessarily when all you need to change is one field.
The effect on the 2nd-level cache needs to be considered if you're using it (and if you edit instances a lot you probably shouldn't). With executeUpdate it will flush all instances previously loaded with get() but if you update with get + save if flushes just that one instance. This gets worse if you're clustered since after executeUpdate you'd clear all of the various cluster node caches vs flushing the one instance on all nodes.
Your best bet is to benchmark both approaches. If you're not overloading the database then you may be prematurely optimizing and using the standard approach might be best to keep things simple while you solve other problems.
If you use get/save, you'll get the maximum advantage of the hibernate cache. executeUpdate might force more selects and updates.
The way executeUpdate interacts with the hibernate cache makes a difference here. The hibernate cache gets invalidated on executeUpdate. The next access of that Item after the executeUpdate would have to go to the database (and possibly more, I think hibernate might invalidate all Items in the cache).
Your best bet is to turn on debug logging for 'org.hibernate' in your Config.groovy and examine the SQL calls.
I think they are equal. The both issue 2 sql calls.
More efficient would be just a single update
Item.executeUpdate("update Item set itemUsage=itemUsage+1 where id =(:itemId)", [ itemId: itemId.id])
You can use the dynamicUpdate mapping attribute in your Item class:
http://grails.org/doc/latest/ref/Database%20Mapping/dynamicUpdate.html
With this option enabled, your second way to update a single field using Gorm will be as efficient as the first one.
I'm currently in the middle of a reasonably large question / answer based application (kind of like stackoverflow / answerbag.com)
We're using SQL (Azure) and nHibernate for data access and MVC for the UI app.
So far, the schema is roughly along the lines of the stackoverflow db in the sense that we have a single Post table (contains both questions / answers)
Probably going to use something along the lines of the following repository interface:
public interface IPostRepository
{
void PutPost(Post post);
void PutPosts(IEnumerable<Post> posts);
void ChangePostStatus(string postID, PostStatus status);
void DeleteArtefact(string postId, string artefactKey);
void AddArtefact(string postId, string artefactKey);
void AddTag(string postId, string tagValue);
void RemoveTag(string postId, string tagValue);
void MarkPostAsAccepted(string id);
void UnmarkPostAsAccepted(string id);
IQueryable<Post> FindAll();
IQueryable<Post> FindPostsByStatus(PostStatus postStatus);
IQueryable<Post> FindPostsByPostType(PostType postType);
IQueryable<Post> FindPostsByStatusAndPostType(PostStatus postStatus, PostType postType);
IQueryable<Post> FindPostsByNumberOfReplies(int numberOfReplies);
IQueryable<Post> FindPostsByTag(string tag);
}
My question is:
Where / how would i fit solr into this for better querying of these "Posts"
(I'll be using solrnet for the actual communication with Solr)
Ideally, I'd be using the SQL db as merely a persistant store-
The bulk of the above IQueryable operations would move into some kind of SolrFinder class (or something like that)
The Body property is the one that causes the problems currently - it's fairly large, and slows down queries on sql.
My main problem is, for example, if someone "updates" a post - adds a new tag, for example, then that whole post will need re-indexing.
Obviously, doing this will require a query like this:
"SELECT * FROM POST WHERE ID = xyz"
This will of course, be very slow.
Solrnet has an nHibernate facility- but i believe this will be the same result as above?
I thought of a way around this, which I'd like your views on:
Adding the ID to a queue (amazon sqs or something - i like the ease of use with this)
Having a service (or bunch of services) somewhere that do the above mentioned query, construct the document, and re-add it to solr.
Another problem I'm having with my design:
Where should the "re-indexing" method(s) be called from?
The MVC controller? or should i have a "PostService" type class, that wraps the instance of IPostRepository?
Any pointers are greatly received on this one!
On the e-commerce site that I work for, we use Solr to provide fast faceting and searching of the product catalog. (In non-Solr geek terms, this means the "ATI Cards (34), NVIDIA (23), Intel (5)" style of navigation links that you can use to drill-down through product catalogs on sites like Zappos, Amazon, NewEgg, and Lowe's.)
This is because Solr is designed to do this kind of thing fast and well, and trying to do this kind of thing efficiently in a traditional relational database is, well, not going to happen, unless you want to start adding and removing indexes on the fly and go full EAV, which is just cough Magento cough stupid. So our SQL Server database is the "authoritative" data store, and the Solr indexes are read-only "projections" of that data.
You're with me so far because it sounds like you are in a similar situation. The next step is determining whether or not it is OK that the data in the Solr index may be slightly stale. You've probably accepted the fact that it will be somewhat stale, but the next decisions are
How stale is too stale?
When do I value speed or querying features over staleness?
For example, I have what I call the "Worker", which is a Windows service that uses Quartz.NET to execute C# IJob implementations periodically. Every 3 hours, one of these jobs that gets executed is the RefreshSolrIndexesJob, and all that job does is ping an HttpWebRequest over to http://solr.example.com/dataimport?command=full-import. This is because we use Solr's built-in DataImportHandler to actually suck in the data from the SQL database; the job just has to "touch" that URL periodically to make the sync work. Because the DataImportHandler commits the changes periodically, this is all effectively running in the background, transparent to the users of the Web site.
This does mean that information in the product catalog can be up to 3 hours stale. A user might click a link for "Medium In Stock (3)" on the catalog page (since this kind of faceted data is generated by querying SOLR) but then see on the product detail page that no mediums are in stock (since on this page, the quantity information is one of the few things not cached and queried directly against the database). This is annoying, but generally rare in our particularly scenario (we are a reasonably small business and not that high traffic), and it will be fixed up in 3 hours anyway when we rebuild the whole index again from scratch, so we have accepted this as a reasonable trade-off.
If you can accept this degree of "staleness", then this background worker process is a good way to go. You could take the "rebuild the whole thing every few hours" approach, or your repository could insert the ID into a table, say, dbo.IdentitiesOfStuffThatNeedsUpdatingInSolr, and then a background process can periodically scan through that table and update only those documents in Solr if rebuilding the entire index from scratch periodically is not reasonable given the size or complexity of your data set.
A third approach is to have your repository spawn a background thread that updates the Solr index in regards to that current document more or less at the same time, so the data is only stale for a few seconds:
class MyRepository
{
void Save(Post post)
{
// the following method runs on the current thread
SaveThePostInTheSqlDatabaseSynchronously(post);
// the following method spawns a new thread, task,
// queueuserworkitem, whatevever floats our boat this week,
// and so returns immediately
UpdateTheDocumentInTheSolrIndexAsynchronously(post);
}
}
But if this explodes for some reason, you might miss updates in Solr, so it's still a good idea to have Solr do a periodic "blow it all away and refresh", or have a reaper background Worker-type service that checks for out-of-date data in Solr everyone once in a blue moon.
As for querying this data from Solr, there are a few approaches you could take. One is to hide the fact that Solr exists entirely via the methods of the Repository. I personally don't recommend this because chances are your Solr schema is going to be shamelessly tailored to the UI that will be accessing that data; we've already made the decision to use Solr to provide easy faceting, sorting, and fast display of information, so we might as well use it to its fullest extent. This means making it explicit in code when we mean to access Solr and when we mean to access the up-to-date, non-cached database object.
In my case, I end up using NHibernate to do the CRUD access (loading an ItemGroup, futzing with its pricing rules, and then saving it back), forgoing the repository pattern because I don't typically see its value when NHibernate and its mappings are already abstracting the database. (This is a personal choice.)
But when querying on the data, I know pretty well if I'm using it for catalog-oriented purposes (I care about speed and querying) or for displaying in a table on a back-end administrative application (I care about currency). For querying on the Web site, I have an interface called ICatalogSearchQuery. It has a Search() method that accepts a SearchRequest where I define some parameters--selected facets, search terms, page number, number of items per page, etc.--and gives back a SearchResult--remaining facets, number of results, the results on this page, etc. Pretty boring stuff.
Where it gets interesting is that the implementation of that ICatalogSearchQuery is using a list of ICatalogSearchStrategys underneath. The default strategy, the SolrCatalogSearchStrategy, hits SOLR directly via a plain old-fashioned HttpWebRequest and parsing the XML in the HttpWebResponse (which is much easier to use, IMHO, than some of the SOLR client libraries, though they may have gotten better since I last looked at them over a year ago). If that strategy throws an exception or vomits for some reason, then the DatabaseCatalogSearchStrategy hits the SQL database directly--although it ignores some parameters of the SearchRequest, like faceting or advanced text searching, since that is inefficient to do there and is the whole reason we are using Solr in the first place. The idea is that usually SOLR is answering my search requests quickly in full-featured glory, but if something blows up and SOLR goes down, then the catalog pages of the site can still function in "reduced-functionality mode" by hitting the database with a limited feature set directly. (Since we have made explicit in code that this is a search, that strategy can take some liberties in ignoring some of the search parameters without worrying about affecting clients too severely.)
Key takeaway: What is important is that the decision to perform a query against a possibly-stale data store versus the authoritative data store has been made explicit--if I want fast, possibly stale data with advanced search features, I use ICatalogSearchQuery. If I want slow, up-to-date data with the insert/update/delete capability, I use NHibernate's named queries (or a repository in your case). And if I make a change in the SQL database, I know that the out-of-process Worker service will update Solr eventually, making things eventually consistent. (And if something was really important, I could broadcast an event or ping the SOLR store directly, telling it to update, possibly in a background thread if I had to.)
Hope that gives you some insight.
We use solr to query a large product database.
Around 1 million products, and 30 stores.
What we did is we used triggers on the product table and stock tables on our Sql server.
Each time a row is changed it flags the product to be reindexed. And we have a windows service that grabs these products and post them to Solr every 10 seconds. (With a limit of 100 products per batch).
It's super efficient, almost real time info for the stock.
If you have a big text field (your 'body' field), then yes, re-index in background. The solutions you mentioned (queue or periodic background service) will do.
MVC controllers should be oblivious of this process.
I noticed you have IQueryables in your repository interface. SolrNet does not currently have a LINQ provider. Anyway, if those operations are all you're going to do with Solr (i.e. no faceting), you might want to consider using Lucene.Net instead, which does have a LINQ provider.