Spark job-server release memory - memory

I've set up a spark job-server (see https://github.com/spark-jobserver/spark-jobserver/tree/jobserver-0.6.2-spark-1.6.1) in standalone mode.
I've created a default context to use. Currently I have 2 kind of jobs on this context:
Synchronization with another server:
Dumps the data from the other server's db;
Perform some joins, reduce the data, generating a new DF;
Save the obtained DF in a parquet file;
Load this parquet file as a temp table and cache it;
Queries: perform sql queries on the cached table.
The only object that I persist is the final table that will be cached.
What I don't get is why when I perform the synchronization, all the assigned memory is used and never released, but, if I load the parquet file directly (doing a fresh start of the server, using the parquet file generated previously), only a fraction of the memory is used.
I'm missing something? There is a way to free up unused memory?
Thank you

You can free up memory by unpersisting cached table: yourTable.unpersist()

Related

Elasticsearch can't sync 2GB database from neo4j with GprahAware

The neo4j2elasticsearch works on my machine when the database is only 250KB. But the the databse is around 2GB. It won't sync anymore. I'm wondering is because of these parameters in the config file:
#optional, size of the in-memory queue that queues up operations to be synchronised to Elasticsearch, defaults to 10000
com.graphaware.module.ES.queueSize=10000
#optional, size of the batch size to use during re-initialization, defaults to 1000
com.graphaware.module.ES.reindexBatchSize=2000
I'm wondering what is the unit of in-memory queue size 10000 and how to estimate what parameter to set based on my database size.
Here is the debug file :
neo4j debug.log failure loading
The database is re-initialized but there are only empty neo4j-index-relationship/neo4j-index-node index in the elasticsearch database
Just for information, here is the debug file for successful 250KB database loading:
neo4j debug.log successful loading
seems like Re-indexing nodes... step is missing in the 2GB database loading procedure.
The log doesn't have a line saying it will re-index, did you configure:
com.graphaware.module.ES.initializeUntil=
To a timestamp that warrants re-indexing on startup? Otherwise, it will only index new data. It is explained at the bottom of https://github.com/graphaware/neo4j-to-elasticsearch :
...in order to trigger (re-)indexing, i.e. sending every node that
should be indexed to Elasticsearch upon Neo4j restart, you have to
manually intervene...
So try to create a new node and see if the synchronization is working for new stuff to eliminate this situation (most common one).

Batch import in Neo4j

I am using the BatchInserter to insert data from a CSV file which works fine when the DB is completely empty (no files in the data directory). But using the BatchInserter I cannot get data into the database and it throws an exception mentioned below. This is with the DB service stopped. I tried several ways and failed. But I need to know if there is a way to import data from a CSV into an existing DB as the CSV is prone to change.
Exception in thread "main" java.lang.IllegalStateException: Misaligned file size 68 for DynamicArrayStore[fileName:neostore.nodestore.db.labels, blockSize:60], expected version length 25
at org.neo4j.kernel.impl.store.AbstractDynamicStore.verifyFileSizeAndTruncate(AbstractDynamicStore.java:265)
at org.neo4j.kernel.impl.store.CommonAbstractStore.loadStorage(CommonAbstractStore.java:217)
at org.neo4j.kernel.impl.store.CommonAbstractStore.<init>(CommonAbstractStore.java:118)
at org.neo4j.kernel.impl.store.AbstractDynamicStore.<init>(AbstractDynamicStore.java:92)
at org.neo4j.kernel.impl.store.DynamicArrayStore.<init>(DynamicArrayStore.java:64)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:327)
at org.neo4j.kernel.impl.store.StoreFactory.newNodeStore(StoreFactory.java:316)
at org.neo4j.kernel.impl.store.StoreFactory.newNeoStore(StoreFactory.java:160)
at org.neo4j.unsafe.batchinsert.BatchInserterImpl.<init>(BatchInserterImpl.java:258)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:94)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:88)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:63)
at org.neo4j.unsafe.batchinsert.BatchInserters.inserter(BatchInserters.java:51)
at net.xalgo.neo4j.batchinserter.DrugDatabaseInserter.start(DrugDatabaseInserter.java:30)
at net.xalgo.neo4j.batchinserter.BatchInserterApp.main(BatchInserterApp.java:56)
The Batch Inserter only works with initial data import (when the database is empty). It avoids transactions and other checks to increase performance and therefore cannot be used with an existing database.
For importing data into an already existing Neo4j database you can use LOAD CSV Cypher.
Finally I had to remove BatchInserter and use spring data Neo4jTemplate with direct cypher in a non transactional way.

Connecting Ruby(Rails) to Nodejs through a pipe

I have a rails app that needs to make use of a Javascript library on the server. Up until now I have been running system commands from rails to nodejs whenever this is necessary. However, I have a particularly computationally intensive task that has made it necessary to cache data to speed it up. I also have to pass large inputs to the node program. As a result I've hit the buffer size of inputs to the node program. I am currently just sending it to separate node processes multiple times in chunks small enough to fit in the buffer, but this is causing performance problems because I now no longer get to take advantage of caching over as many runs. I would like to use a pipe to do this, but my pipe hits the buffer as well, and I don't know how to empty it. So far I have...
#ruby file
output=[]
node_pipe=IO.popen("nodejs /home/user/node_program.js","w+")
10_000.times do |time|
node_pipe.write("a lot of stuff")
#here I would like to read contents and push contents to output array but still be
#able to write to the same process in the next loop to take advantage of the cache
end
//node_program.js
var input=process.stdin;
var cache={};
input.resume();
input.on('data',function(chunk){
cache[chunk]=library_function(chunk);
console.log(String(other_library_function(chunk)));
}
Any suggestions?
`

EmbeddedReadOnlyGraphDatabase complaining about locked database

Exception in thread "main" java.lang.IllegalStateException: Database locked.
at org.neo4j.kernel.InternalAbstractGraphDatabase.create(InternalAbstractGraphDatabase.java:289)
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:227)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:81)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:72)
at org.neo4j.kernel.EmbeddedReadOnlyGraphDatabase.<init>(EmbeddedReadOnlyGraphDatabase.java:54)
at QueryNodeReadOnly.main(QueryNodeReadOnly.java:55)
This is using 1.8.2 version of neo4j. I've written a program that opens the db in readonly mode, querying and and make it sleep for a while before exiting.
Here is the relevant text
graphDb = new EmbeddedReadOnlyGraphDatabase( dbname); // Line 55 - the exception.
......
......
......
......
......
if(sleepVal > 0)
Thread.sleep(sleepVal);
I reckon I should not be getting this error. There are only 2 processes that open the db , both in read-only mode. In fact, it should work even if i open the db when another process has opened it to write to it.
We disallow two databases accessing the same files on disk at the same time - even in read-only mode.
The reason being is that while we do not allow you to modify the database in read-only mode, Lucene will still write to disk when servicing your read requests, and having two instances access those same index files leads to race conditions and index corruptions.
Why is it you want 2x instances accessing the same files at the same time anyway? What is your use case?
You can't make multiple connections to an embedded database. Maybe you should consider using the REST server.

Suds is not reusing cached WSDLs and XSDs, although I expect it to

I'm pretty sure suds is not caching my WSDLs and XSDs like I expect it to. Here's how I know that cached objects are not being used:
It takes about 30 seconds to create a client: client = Client(url)
The logger entries show consistent digestion of the XSD and WSDL files during the entire 30 seconds
Wireshark is showing consistent TCP traffic to the server storing the XSD and WSDL files during the entire 30 seconds
I see the files in the cache being updated each time I run my program
I have a small program that creates a suds client, sends a single request, gets the response, then ends. My expectation is that each time I run the program, it should fetch the WSDL and XSD files from the file cache, not from the URLs. Here's why I think that:
client.options.cache.duration is set to ('days', 1)
client.options.cache.location is set to c:\docume~1\mlin\locals~1\temp\suds and I see the cache files being generated and re-generated each time I run the program
For a moment I thought that maybe the cache is not reused between runs of a program, but I don't think a file cache would be used if that were the case, because an in-memory cache would do just fine
Am I misunderstanding how suds caching is supposed to work?
The problem is in the suds library itself. In cache.py, although ObjectCache.get() is always getting a valid file pointer, it's hitting an exception (EOFError) doing pickle.load(fp). When that happens, the file is just downloaded again.
Here's the sequence of events:
DocumentReader.open():
Trying http://172.28.50.249/wsdl/billingServices/v3.0/RequestScrubAddress.wsdl
Loading ObjectCache 51012453-document
Loading pickled object...
Exception raised:
Got None from cache
Downloading... Done
Saving FileCache 51012453-document... Done
So it doesn't really matter that the new cache file was saved, because the same thing happens the next time I run. This happens for ALL of WSDL and XSD files.
I fixed that problem by opening the cache file in binary mode when reading and writing. Specifically, the changes I made were in cache.py:
1) In FileCache.put(), change this line:
f = self.open(fn, 'w')
to
f = self.open(fn, 'wb')
2) In FileCache.getf(), change this line:
return self.open(fn)
to
return self.open(fn, 'rb')
I don't know the codebase well enough to know if these changes are safe, but it is pulling the objects from the file cache, the service is still running successfully, and loading the client went from 16 seconds down to 2.5 seconds. Much better if you ask me.
Hopefully this fix, or something similar can be introduced back into the suds main line. I've already sent this to the suds mailing list (fedora-suds-list at redhat dot com).

Resources