Mahout, PostgreSQL and connection pool

Mahout, PostgreSQL and connection pool - connection-pooling

Using PostgreSQL as data source in Mahout 0.9, I keep getting
WARN org.apache.mahout.cf.taste.impl.model.jdbc.AbstractJDBCDataModel - You are not using ConnectionPoolDataSource. Make sure your DataSource pools connections to the database itself, or database performance will be severely reduced.
warning.
And it's true - each request keeps opening new connections.
Is there any way to use PGConnectionPoolDataSource for PostgreSQLBooleanPrefJDBCDataModel?
(currently I have no constructorfor arguments n Java::OrgApacheMahoutCfTasteImplModelJdbc::PostgreSQLBooleanPrefJDBCDataModel error)

In fact, you'll likely don't need a connections pool at all, as the right solution is to use memory-based ReloadFromJDBCDataModel wrapper, which, as the side-effect, decreases the number of connections to 1.

Related

Neo4j TransactionMemoryLimit

I am running Neo4j (v4.1.5) community edition on a server node with 64GB RAM.
I set the heap size configuration as follows:
dbms.memory.heap.initial_size=31G
dbms.memory.heap.max_size=31G
During the ingestion via bolt, I got the following error:
{code: Neo.TransientError.General.TransactionMemoryLimit} {message:
Can't allocate extra 512 bytes due to exceeding memory limit;
used=2147483648, max=2147483648}
What I don't understand is that the max in the error message shows 2GB, while I've set the initial and max heap size to 31GB. Can someone help me understand how memory setting works in Neo4j?

It turned out that the default transaction memory allocation for this version was OFF_HEAP. Meaning that all the transactions were executed off heap with 2GB max. Adding the following setting in Neo4j resolved the issue:
dbms.tx_state.memory_allocation=ON_HEAP
I'm not sure why OFF_HEAP is the default setting while Neo4j manual recommends ON_HEAP setting:
When executing a transaction, Neo4j holds not yet committed data, the result, and intermediate states of the queries in memory. The size needed for this is very dependent on the nature of the usage of Neo4j. For example, long-running queries, or very complicated queries, are likely to require more memory. Some parts of the transactions can optionally be placed off-heap, but for the best performance, it is recommended to keep the default with everything on-heap.

While creating a very large number of relationships neo4j keeps saying server is disconnected

I have a large amount of relationships to create using cypher and I keep getting the following error: Connection to server lost. Reconnecting..
Increased memory. Ran the query and neo was using ~37GB of memory, with the rest going to RAM cache/buffer. Diskspace and cpu usage seemed to be ok. Server keeps saying Connection to server lost. Reconnecting..
EXPLAIN
MATCH (r:Room),
(t:Thread)
WHERE EXISTS (r.unique_room_id) AND EXISTS (t.unique_room_id) AND r.unique_room_id=t.unique_room_id
CREATE (r)-[:PUBLISHED]->(t);
Expected results would be to create millions of relationships. In the image below you can see the details behind how this query is being executed. Any suggestions? Thank you! Query plan

Make sure your server has correctly configured heap and page-cache.
You can use the apoc library to batch this operation:
Presuming you have an index on :Thread(unique_room_id)
call apoc.periodic.iterate('
MATCH (r:Room)
WITH r.unique_room_id as unique_room_id
MATCH (t:Thread) WHERE t.unique_room_id = unique_room_id
RETURN t, r
','CREATE (r)-[:PUBLISHED]->(t)', {batchSize:100000});
see: http://neo4j-contrib.github.io/neo4j-apoc-procedures/3.5/cypher-execution/commit-batching/

How to paginate using Orient 3.0 streams

According to streaming example at http://orientdb.com/docs/3.0.x/java/Java-Query-API.html, we can use the Orient result set streaming API as follows
ODatabaseDocument db;
...
String statement = "SELECT FROM V WHERE name = ? and surnanme = ?";
OResultSet rs = db.query(statement, "John", "Smith");
rs.stream().forEach(x -> System.out.println(x.getProperty("age")));
rs.close();
This is fine but too trivial - what if we need to keep the rs/stream around? We can't very well close the resultset because we want to reuse the stream on a subsequent user request in a web application, say (in scenarios such as paging).
But to keep the streams "alive" the Orient user guide says that:
OResultSet is implemented as a paginated structure, that holds some
iterators open during the iteration. This is true both in remote and
in embedded usage.
You should always invoke OResultSet.close() at the end of the
execution, to free resources.
OResultSet instances are automatically closed when you close the
ODatabase that returned them.
It is important to always close result sets, even when they are
converted to streams (after the stream is consumed).
Are there any best practices around this. As far as I can tell, we would need to:
1) Keep the Orient database connection open until the user "paging" session is done (which could be say 5-10 minutes). Only when the user says "done" can we close the result set & close the database connection. The Orient database connection (and whatever stream it generated) thus becomes "private" to a single application user. Moreover, since every user request can be activated on a different thread, the said database connection would need to be made active on the current thread before using it.
2) Use the Java Stream API to navigate through arbitrary subsets of the "arbitrarily" large resultset. How would memory usage be handled by the underlying Orient db stream implementation? What determines the memory usage for using a "single rs/stream" and keeping it around for a while? What happens when we have thousands of open rs/streams especially if each user has their own "private" rs/stream they're looking at?
3) If a given Orient database connection can only be used on a single thread at a time (an Orient requirement), how do we handle multiple users with their own custom long-lived rs/streams/connections? Does this mean that if we have a 1000 clients using their own private rs/stream (that they hang on to for say 5 minutes), then we have to keep 1000 database connections open (i.e. one for each user/rs?) What are the limits around this? This style is obviously quite different from the more typical execute query/close rs pattern for quick user interaction that is stateless from one request to the next (naive paging that re-executes queries every time for a given range and this can get expensive)
P.S. I realize that once we get a Java stream, then we pretty much start just using the Java API itself - so I suppose that JOOQ streaming usage (for example) would be pretty similar to Orient streaming usage once you start getting into the Stream interfaces - I'm not familiar with the Java Streams API, but I suppose How to paginate a list of objects in Java 8? is a good place to start?

My conclusion is that streaming works well when scrolling through a large result set without consuming a large amount of memory or having to keep re-executing offset/limit queries (similar to forward only scrolling over JDBC resultsets). A typical use case is an export scenario.
For forward and backward paging, in Orient at least, you likely need an indexed property/properties and perform range queries - you'll need to make sure the index is SB-tree so that it supports range queries.
FYI, Solr has a cursor mechanism which works pretty well for forward pagination on sorted results - but if you keep some simple state markers on the client you can also go back to results already encountered. "go to" random pages is not supported in Solr cursors but you can always re-sort/filter on some other criteria in order to move "useful" results to the top of the resultset instead of deep paging (https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html)

Neo4j In memory configurations, multithreading, and slow writes

How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3

You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.

ActiveRecord bulk data, memory grows forever

I am using ActiveRecord to bulk migrate some data from a table in one database to a different table in another database. About 4 million rows.
I am using find_each to fetch in batches. Then I do a little bit of logic to each record fetched, and write it to a different db. I have tried both directly writing one-by-one, and using the nice activerecord-import gem to batch write.
However, in either case, my ruby process memory usage is growing quite a bit throughout the life of the export/import. I would think that using find_each, I'm getting batches of 1000, there should only be 1000 of them in memory at a time... but no, each record I fetch seems to be consuming memory forever, until the process is over.
Any ideas? Is ActiveRecord caching something somewhere that I can turn off?
update 17 Jan 2012
I think I'm going to give up on this. I have tried:
* Making sure everything is wrapped in a ActiveRecord::Base.uncached do
* Adding ActiveRecord::IdentityMap.enabled = false (I think that should turn off the identity map for the current thread, although it's not clearly documented, and I think the identity map isn't on by default in current Rails anyhow)
Neither of those seem to have much effect, memory is still leaking.
I then added a periodic explicit:
GC.start
That seems to slow down the rate of memory leak, but the memory leak still happens (eventually exhausting all memory and bombing).
So I think I'm giving up, and deciding it is not currently possible to use AR to read millions of rows from one db and insert them into another. Perhaps there is a memory leak in MySQL-specific code being used (that's my db), or somewhere else in AR, or who knows.

I would suggest queueing each unit of work into a Resque queue . I have found that ruby has some quirks when iterating over large arrays like these.
Have one main thread that queue's up the work by ID, then have multiple resque workers hitting that queue to get the work done.
I have used this method on approx 300k records, so it would most likely scale to millions.

Change line #86 to bulk_queue = [] since bulk_queue.clear only sets the length of the arrya to 0 makeing it impossible for the GC to clear it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart