BatchInserters.batchDatabase fails - sometimes - silently to persist node properties - neo4j

I use BatchInserters.batchDatabase to create an embedded Neo4j 2.1.5 data base. When I only put a small amount of data in it, everything works fine.
But if I increase the size of data put in, Neo4j fails to persist the latest properties set with setProperty. I can read back those properties with getProperty before I call shutdown. When I load the data base again with new GraphDatabaseFactory().newEmbeddedDatabase those properies are lost.
The strange thing about this is that Neo4j doesn't report any error or throw an exception. So I have no clue what's going wrong or where. Java should have enough memory to handle both the small data base (Database 2.66 MiB, 3,000 nodes, 3,000 relationships) and the big one (Database 26.32 MiB, 197,267 nodes, 390,659 relationships)
It's hard for me to extract a running example to show you the problem, but I can do if this helps. Here the main steps I do though:
def createDataBase(rules: AllRules) {
// empty the data base folder
deleteFileOrDirectory(new File(mainProjectPathNeo4j))
// Create an index on some properties
db = new GraphDatabaseFactory().newEmbeddedDatabase(mainProjectPathNeo4j)
engine = new ExecutionEngine(db)
createIndex()
db.shutdown()
// Fill the data base
db = BatchInserters.batchDatabase(mainProjectPathNeo4j)
//createBatchIndex
try {
// Every function loads some data
loadAllModulesBatch(rules)
loadAllLinkModulesBatch(rules)
loadFormalModulesBatch(rules)
loadInLinksBatch()
loadHILBatch()
createStandardLinkModules(rules)
createStandardLinkSets(rules)
// validateModel shows the problem
validateModel(rules)
} catch {
// I want to see if my environment (BIRT) is catching any exceptions
case _ => val a = 7
} finally {
db.shutdown()
}
}
validateModel is updating some properties of already created nodes
def validateModule(srcM: GenericModule) {
srcM.node.setProperty("isValidated", true)
assert(srcM.node == Neo4jScalaDataSource.testNode)
assert(srcM.node eq Neo4jScalaDataSource.testNode)
assert(srcM.node.getProperty("isValidated").asInstanceOf[Boolean])
When I finally use Cypher to get some data back
the properties set by validateModel are missing
class Neo4jScalaDataSet extends ScriptedDataSetEventAdapter {
override def beforeOpen(...) {
result = Neo4jScalaDataSource.engine.profile(
"""
MATCH (fm:FormalModule {isValidated: true}) RETURN fm.fullName as fullName, fm.uid as uid
""");
iter = result.iterator()
}
override def fetch(...) = {
if (iter.hasNext()) {
for (e <- iter.next().entrySet()) {
row.setColumnValue(e.getKey(), e.getValue())
}
count += 1;
row.setColumnValue("count", count)
return true
} else {
logger.log(Level.INFO, result.executionPlanDescription().toString())
return super.fetch(dataSet, row)
}
}

batchDatabase indeed causes this problem.
I have switched to BatchInserters.inserter and now everything works just fine.

Related

How do I get a continuation token for a bulk INSERT on Azure Cosmos DB?

I want to upload a CSV file that represents 10k documents to be added to my Cosmos DB collection in a manner that's fast and atomic. I have a stored procedure like the following pseudo-code:
function createDocsFromCSV(csv_text) {
function parse(txt) { // ... parsing code here ... }
var collection = getContext().getCollection();
var response = getContext().getResponse();
var docs_to_create = parse(csv_text);
for(var ii=0; ii<docs_to_create.length; ii++) {
var accepted = collection.createDocument(collection.getSelfLink(),
docs_to_create[ii],
function(err, doc_created) {
if(err) throw new Error('Error' + err.message);
});
if(!accepted) {
throw new Error('Timed out creating document ' + ii);
}
}
}
When I run it, the stored procedure creates about 1200 documents before timing out (and therefore rolling back and not creating any documents).
Previously I had success updating (instead of creating) thousands of documents in a stored procedure using continuation tokens and this answer as guidance: https://stackoverflow.com/a/34761098/277504. But after searching documentation (e.g. https://azure.github.io/azure-documentdb-js-server/Collection.html) I don't see a way to get continuation tokens from creating documents like I do for querying documents.
Is there a way to take advantage of stored procedures for bulk document creation?
It’s important to note that stored procedures have bounded execution, in which all operations must complete within the server specified request timeout duration. If an operation does not complete with that time limit, the transaction is automatically rolled back.
In order to simplify development to handle time limits, all CRUD (Create, Read, Update, and Delete) operations return a Boolean value that represents whether that operation will complete. This Boolean value can be used a signal to wrap up execution and for implementing a continuation based model to resume execution (this is illustrated in our code samples below). More details, please refer to the doc.
The bulk-insert stored procedure provided above implements the continuation model by returning the number of documents successfully created.
pseudo-code:
function createDocsFromCSV(csv_text,count) {
function parse(txt) { // ... parsing code here ... }
var collection = getContext().getCollection();
var response = getContext().getResponse();
var docs_to_create = parse(csv_text);
for(var ii=count; ii<docs_to_create.length; ii++) {
var accepted = collection.createDocument(collection.getSelfLink(),
docs_to_create[ii],
function(err, doc_created) {
if(err) throw new Error('Error' + err.message);
});
if(!accepted) {
getContext().getResponse().setBody(count);
}
}
}
Then you could check the output document count on the client side and re-run the stored procedure with the count parameter to create the remaining set of documents until the count larger than the length of csv_text.
Hope it helps you.

Hashset handling to avoid stuck in loop during iteration

I'm working on image mining project, and I used Hashset instead of array to avoid adding duplicate urls while gathering urls, I reached to the point of code to iterate the Hashset that contains the main urls and within the iteration I go and download the the page of the main URL and add them to the Hashet, and go on , and during iteration I should exclude every scanned url, and also exclude ( remove ) every url that end with jpg, this until the Hashet of url count reaches 0, the question is that I faced endless looping in this iteration , where I may get url ( lets call it X )
1- I scan the page of url X
2- get all urls of page X ( by applying filters )
3- Add urls to the Hashset using unioinwith
4- remove the scanned url X
the problem comes here when one of the URLs Y, when scanned bring X again
shall I use Dictionary and the key as "scanned" ?? I will try and post the result here, sorry it comes to my mind after I posted the question...
I managed to solve it for one url, but it seems it happens with other urls to generate loop, so how to handle the Hashset to avoid duplicate even after removing the links,,, I hope that my point is clear.
while (URL_Can.Count != 0)
{
tempURL = URL_Can.First();
if (tempURL.EndsWith("jpg"))
{
URL_CanToSave.Add(tempURL);
URL_Can.Remove(tempURL);
}
else
{
if (ExtractUrlsfromLink(client, tempURL, filterlink1).Contains(toAvoidLoopinLinks))
{
URL_Can.Remove(tempURL);
URL_Can.Remove(toAvoidLoopinLinks);
}
else
{
URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink1));
URL_Can.UnionWith(ExtractUrlsfromLink(client, tempURL, filterlink2));
URL_Can.Remove(tempURL);
richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
}
}
toAvoidLoopinLinks = tempURL;
}
Thanks for All, I managed to solve this issue using Dictionary instead of Hashset, and use the Key to hold the URL , and the value to hold int , to be 1 if the urls is scanned , or 0 if the url still not processed, below is my code.
I used another Dictionary "URL_CANtoSave to hold the url that ends with jpg "my target"...and this loop of While..can loop until all the url of the website ran out based on the values you specify in the filter string variable that you parse the urls accordingly.
so to break the loop you can specify amount of images url to get in the URL_CantoSave.
return Task.Factory.StartNew(() =>
{
try
{
string tempURL;
int i = 0;
// I used to set the value of Dictionary Key, 1 or 0 ( 1 means scanned,
0 means not yet and to iterate until all the Dictionry Keys are scanned or you break in the middle based on how much images urls you collected in the other Dictionary
while (URL_Can.Values.Where(value => value.Equals(0)).Any())
{
// take 1 key and put it in temp variable
tempURL = URL_Can.ElementAt(i).Key;
// check if it ends with your target file extension. in this case image file
if (tempURL.EndsWith("jpg"))
{
URL_CanToSave.Add(tempURL,0);
URL_Can.Remove(tempURL);
}
// if not image go and download the page based on the url and keep analyzing
else
{
// if the url not scanned before then
if (URL_Can[tempURL] != 1)
{
// here it seems complex little bit, where Add2Dic is process to add to Dictionaries without adding the Key again ( solving main problem !! )
"ExtractURLfromLink" is another process that return dictionary with all links analyzed by downloading the document string of the url and analyzing it ,
you can add remove filter string based on you analysis
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink1), URL_Can, false);
Add2Dic(ExtractUrlsfromLink(client, tempURL, filterlink2), URL_Can, false);
URL_Can[tempURL] = 1; // to set it as scanned link
richTextBox2.PerformSafely(() => richTextBox2.AppendText(tempURL + "\n"));
}
}
statusStrip1.PerformSafely(() => toolStripProgressBar1.PerformStep());
// here comes the other trick to keep this iteration keeps going until it scans all gathered links
i++; if (i >= URL_Can.Count) { i = 0; }
if (URL_CanToSave.Count >= 150) { break; }
}
richTextBox2.PerformSafely(() => richTextBox2.Clear());
textBox1.PerformSafely(() => textBox1.Text = URL_Can.Count.ToString());
return ProcessCompleted = true;
}
catch (Exception aih)
{
MessageBox.Show(aih.Message);
return ProcessCompleted = false;
throw;
}
{
richTextBox2.PerformSafely(()=>richTextBox2.AppendText(url+"\n"));
}
})

Grails, Promise API and two open sessions

I am trying to clear out a collection and update it at the same time. It has children and finding the current items in the collection and deleting them asynchronously would save me a lot of time.
Step 1. Find all the items in the collection.
Step 2. Once I know what the items are, fork a process to delete them.
def memberRedbackCriteria = MemberRedback.createCriteria()
// #1 Find all the items in the collection.
def oldList = memberRedbackCriteria.list { fetchMode("memberCategories", FetchMode.EAGER) }
// #2 Delete them.
Promise deleteOld = task {
oldList.each { MemberRedback rbMember ->
rbMember.memberCategories.clear()
rbMember.delete()
}
}
The error message is: Illegal attempt to associate a collection with two open sessions
I am guessing that because I find the items, then fork, this creates a new session so that the collection is built before forking and a new session is used to delete the items.
I need to collect the items in the current thread, otherwise I am not sure what the state would be.
Note that using one async task for all the deletions is effectively running all the delete operations in series in a single thread. Assuming your database can handle multiple connections and concurrent modification of a table, you could parallelize the deletions by using a PromiseList, as in the following (note untested code follows).
def deletePromises = new PromiseList()
redbackIds.each { Long rbId ->
deletePromises << MemberRedback.async.task {
withTransaction {
def memberRedbackCriteria = createCriteria()
MemberRedback memberRedback = memberRedbackCriteria.get {
idEq(rbId)
fetchMode("memberCategories", FetchMode.EAGER) }
memberRedback.memberCategories.clear()
memberRedback.delete()
}
}
}
deletePromises.onComplete { List results ->
// do something with the results, if you want
}
deletePromises.onError { Throwable err ->
// do something with the error
}
Found a solution. Put the ids into a list and collect them as part of the async closure.
Note also that you cannot reuse the criteria as per http://jira.grails.org/browse/GRAILS-1967
// #1 find the ids
def redbackIds = MemberRedback.executeQuery(
'select mr.id from MemberRedback mr',[])
// #2 Delete them.
Promise deleteOld = task {
redbackIds.each { Long rbId ->
def memberRedbackCriteria = MemberRedback.createCriteria()
MemberRedback memberRedback = memberRedbackCriteria.get {
idEq(rbId)
fetchMode("memberCategories", FetchMode.EAGER) }
memberRedback.memberCategories.clear()
memberRedback.delete()
}
}
deleteOld.onError { Throwable err ->
println "deleteAllRedbackMembers An error occured ${err.message}"
}

neo4j 2.0 findNodesByLabelAndProperty not working

I'm currently trying the Neo4j 2.0.0 M3 and see some strange behaviour. In my unit tests, everything works as expected (using an newImpermanentDatabase) but in the real thing, I do not get results from the graphDatabaseService.findNodesByLabelAndProperty.
Here is the code in question:
ResourceIterator<Node> iterator = graphDB
.findNodesByLabelAndProperty(Labels.User, "EMAIL_ADDRESS", emailAddress)
.iterator();
try {
if (iterator.hasNext()) { // => returns false**
return iterator.next();
}
} finally {
iterator.close();
}
return null;
This returns no results. However, when running the following code, I see my node is there (The MATCH!!!!!!!!! is printed) and I also have an index setup via the schema (although that if I read the API, this seems not necessary but is important for performance):
ResourceIterator<Node> iterator1 = GlobalGraphOperations.at(graphDB).getAllNodesWithLabel(Labels.User).iterator();
while (iterator1.hasNext()) {
Node result = iterator1.next();
UserDao.printoutNode(emailAddress, result);
}
And UserDao.printoutNode
public static void printoutNode(String emailAddress, Node next) {
System.out.print(next);
ResourceIterator<Label> iterator1 = next.getLabels().iterator();
System.out.print("(");
while (iterator1.hasNext()) {
System.out.print(iterator1.next().name());
}
System.out.print("): ");
for(String key : next.getPropertyKeys()) {
System.out.print(key + ": " + next.getProperty(key).toString() + "; ");
if(emailAddress.equals( next.getProperty(key).toString())) {
System.out.print("MATCH!!!!!!!!!");
}
}
System.out.println();
}
I already debugged through the code and what I already found out is that I pass via the InternalAbstractGraphDatabase.map2Nodes to a DelegatingIndexProxy.getDelegate and end up in IndexReader.Empty class which returns the IteratorUtil.EMPTY_ITERATOR thus getting false for iterator.hasNext()
Any idea's what I am doing wrong?
Found it:
I only included neo4j-kernel:2.0.0-M03 in the classpath. The moment I added neo4j-cypher:2.0.0-M03 all was working well.
Hope this answer helps save some time for other users.
#Neo4j: would be nice if an exception would be thrown instead of just returning nothing.
#Ricardo: I wanted to but I was not allowed yet as my reputation wasn't good enough as a new SO user.

ehcache is not honoring maxElementsInMemory

I have a fairly simple cache configuration:
<cache name="MyCache"
maxElementsInMemory="200000"
eternal="false"
timeToIdleSeconds="43200"
timeToLiveSeconds="43200"
overflowToDisk="false"
diskPersistent="false"
memoryStoreEvictionPolicy="LRU"
/>
I create my cache in the following way:
private Ehcache myCache =
CacheManager.getInstance().getEhcache("MyCache");
I use my cache like this:
public MyResponse processRequest(MyRequest request) {
Element element = myCache.get(request);
if (element != null) {
return (MyResponse)element.getValue();
} else {
MyResponse response = remoteService.process(request);
myCache.put(new Element(request, response));
return response;
}
}
Every 10,000 calls to processRequest() method, I log stats about my cache like this:
logger.debug("Cache name: " + myCache.getName());
logger.debug("Max elements in memory: " + myCache.getMaxElementsInMemory());
logger.debug("Memory store size: " + myCache.getMemoryStoreSize());
logger.debug("Hit count: " + myCache.getHitCount());
logger.debug("Miss count: " + myCache.getMissCountNotFound());
logger.debug("Miss count (because expired): " + myCache.getMissCountExpired());
..I see a good amount of hits, which tells me that it's working.
..However, what I'm seeing is that after a couple hours, the getMemoryStoreSize() is starting to exceed getMaxElementsInMemory(). Eventually, it gets bigger and bigger, and renders the jvm unstable because GC is starting to do Full GCs nonstop to reclaim memory (and I have a pretty large cap set). When I profiled the heap, it pointed to the LRU's SpoolingLinkedHashMap taking most of the space.
I do have a lot of requests hitting this cache, and my theory is that ehcache's LRU algorithm is perhaps not keeping up with evicting the elements when it's full. I tried LFU policy and it also caused the memory store to go over maxElements.
I then started looked at the ehcache code to see if I could prove my theory (inside LruMemoryStore$SpoolingLinkedHashMap):
private boolean removeLeastRecentlyUsedElement(Element element) throws CacheException {
//check for expiry and remove before going to the trouble of spooling it
if (element.isExpired()) {
notifyExpiry(element);
return true;
}
if (isFull()) {
evict(element);
return true;
} else {
return false;
}
}
..from here looks ok, then looked at the evict() method:
protected final void evict(Element element) throws CacheException {
boolean spooled = false;
if (cache.isOverflowToDisk()) {
if (!element.isSerializable()) {
if (LOG.isDebugEnabled()) {
LOG.debug(new StringBuffer("Object with key ").append(element.getObjectKey())
.append(" is not Serializable and cannot be overflowed to disk"));
}
} else {
spoolToDisk(element);
spooled = true;
}
}
if (!spooled) {
cache.getCacheEventNotificationService().notifyElementEvicted(element, false);
}
}
..this looks like it doesn't actually evict (despite the name) but rather relies on the caller to evict. So I looked at the implementation of the put() method and I don't see it calling it. I'm clearly missing something here and would appreciate some help on this.
Thanks!
Your configuration looks fine to me. Only need is to use right key for caching.
Do not put complete request object as your key for cache. Put some unique value from your request object. For example:
MyResponse response = remoteService.process(request);
myCache.put(new Element(request.getCustomerID(), response));
return response;
This should work for you. The reason your caching is not working is that each time your request object is new object; it never finds the response from cache, so it keeps adding into cache.
maxElementsInMemory attribute is deprecated, use maxEntriesLocalHeap instead

Resources