Neo4j import runtime issue - neo4j

I am trying to import a design with 230M Nodes, 300M relationships and 1.5B properties overall.
It takes nearly 5.5 hours to import the design. Wondering how to improve the runtime.
If i analyze the messages from Neo4j import, it takes quite a bit of time in relationship -> relationship. Not sure as what it does here.
Any suggestion to improve the load time:
My run command is :
/home/neo4j-enterprise-3.3.2/bin/neo4j-admin import --nodes "./instances." --relationships:SIGN_OF "./sign." --relationships:RIN_OF "./rin.* --id-type=INTEGER --database graph.db;
My heap initial and max size is set to 32G
Instance header:
NodeId:ID,:IGNORE,:Label,Din:int,LGit:int,RGit:int,Signed:int,Cens,Type,:IGNORE,Val:Float
Signal Header:
:IGNORE,:IGNORE,LGit:int,RGit:int,Signed:int,:IGNORE,:START_ID,:END_ID
Rin header
:START_ID,:END_ID,:IGNORE
Neo4j import output
Available resources:
Total machine memory: 504.70 GB
Free machine memory: 88.71 GB
Max heap memory : 26.67 GB
Processors: 16
Configured max memory: 55.84 GB
Nodes, started 2018-04-09 17:52:36.028+0000
[>:|NODE:1.75 GB--------------|PROPERTIES(3)=====|LABEL |*v:87.95 MB/s(4)=====================] 234M ∆ 819K
Done in 4m 13s 984ms
Prepare node index, started 2018-04-09 17:56:50.351+0000
[*DETECT:2.62 GB------------------------------------------------------------------------------] 234M ∆71.2M30000
Done in 33s 546ms
Relationships, started 2018-04-09 17:57:23.935+0000
[>||PREPARE-----------------------------------|||*v:20.96 MB/s(16)============================] 303M ∆ 256K
Done in 7m 7s 922ms
Node Degrees, started 2018-04-09 18:04:37.914+0000
[*>(16)=============================================================================|CALCULATE] 303M ∆1.97M
Done in 1m 30s 566ms
Relationship --> Relationship 1-2/2, started 2018-04-09 18:06:08.951+0000
[*>------------------------------------------------------------------------------------------|] 303M ∆ 144K
Done in 2h 8m 4s 36ms
RelationshipGroup 1-2/2, started 2018-04-09 20:14:13.059+0000
[>:4.44 MB/s----------|*v:2.22 MB/s(2)========================================================] 186K ∆9.81K
Done in 2s 105ms
Node --> Relationship, started 2018-04-09 20:14:15.178+0000
[*>------------------------------------------------------------------------------------------|] 234M ∆76.4K
Done in 27m 53s 408ms
Relationship --> Relationship 1-2/2, started 2018-04-09 20:42:08.654+0000
[*>------------------------------------------------------------------------------------------|] 303M ∆36.0K
Done in 2h 33m 24s 201ms
Count groups, started 2018-04-09 23:15:33.152+0000
[*>(16)=======================================================================================] 186K ∆59.8K
Done in 3s 898ms
Gather, started 2018-04-09 23:15:41.513+0000
[>(6)===|*CACHE-------------------------------------------------------------------------------] 186K ∆ 186K
Done in 322ms
Write, started 2018-04-09 23:15:41.859+0000
[>:1.30 |*v:1.11 MB/s(16)=====================================================================] 186K ∆21.2K
Done in 4s 161ms
Node --> Group, started 2018-04-09 23:15:46.117+0000
[*FIRST---------------------------------------------------------------------------------------] 148K ∆1.09K
Done in 4m 8s 747ms
Node counts, started 2018-04-09 23:19:55.032+0000
[*>(16)===========================================================|COUNT:1.79 GB--------------] 234M ∆4.63M
Done in 3m 33s 201ms
Relationship counts, started 2018-04-09 23:23:28.254+0000
[*>(16)===================================================================|COUNT--------------] 303M ∆ 450K
Done in 1m 29s 457ms
IMPORT DONE in 5h 32m 23s 509ms.
Imported:
234425118 nodes
303627293 relationships
1496022710 properties
Peak memory usage: 2.69 GB

Related

Stardog on VM Linux Ubutu - memory capacity

We are experiencing performance problems with Stardog requests (about 500 000ms minimum to get an answer). We followed the Debian Based Systems installation described in the Stardog documentation and have a stardog service installed in our Ubutu VM.
Azure machine: Standard D4s v3 (4 virtual processors, 16 Gb memory)
Total amount of memory of the VM = 16 Gio of memory
We tested several JVM environment variables
Xms4g -Xmx4g -XX:MaxDirectMemorySize=8g
Xms8g -Xmx8g -XX:MaxDirectMemorySize=8g
We also tried to upgrade the VM with a machine but without success:
Azure: Standard D8s v3 - 8 virtual processors, 32 Gb memory
By doing the command: systemctl status stardog in the machine with 32Gio memory
we get :
stardog.service - Stardog Knowledge Graph
Loaded: loaded (/etc/systemd/system/stardog.service; enabled; vendor prese>
Active: active (running) since Tue 2023-01-17 15:41:40 UTC; 1min 35s ago
Docs: https://www.stardog.com/
Process: 797 ExecStart=/opt/stardog/stardog-server.sh start (code=exited, s>
Main PID: 969 (java)
Tasks: 76 (limit: 38516)
Memory: 1.9G
CGroup: /system.slice/stardog.service
└─969 java -Dstardog.home=/var/opt/stardog/ -Xmx8g -Xms8g XX:MaxD
stardog-admin server status :
Access Log Enabled : true
Access Log Type : text
Audit Log Enabled : true
Audit Log Type : text
Backup Storage Directory : .backup
CPU Load : 1.88 %
Connection Timeout : 10m
Export Storage Directory : .exports
Memory Heap : 305M (Max: 8.0G)
Memory Mode : DEFAULT{Starrocks.block_cache=20, Starrocks.dict_block_cache=10, Native.starrocks=70, Heap.dict_value=50, Starrocks.txn_block_cache=5, Heap.dict_index=50, Starrocks.untracked_memory=20, Starrocks.memtable=40, Starrocks.buffer_pool=5, Native.query=30}
Memory Query Blocks : 0B (Max: 5.7G)
Memory RSS : 4.3G
Named Graph Security : false
Platform Arch : amd64
Platform OS : Linux 5.15.0-1031-azure, Java 1.8.0_352
Query All Graphs : false
Query Timeout : 1h
Security Disabled : false
Stardog Home : /var/opt/stardog
Stardog Version : 8.1.1
Strict Parsing : true
Uptime : 2 hours 18 minutes 51 seconds
Knowing that there is only stardog server installed in this VM, 8G JVM Heap Memory & 20G Direct Memory for Java, is it normal to have 1.9G in memory (No process in progress)
and 4.1G (when the query is in progress)
"databases.xxxx.queries.latency": {
"count": 7,
"max": 471.44218324400003,
"mean": 0.049260736982859085,
"min": 0.031328932000000004,
"p50": 0.048930366,
"p75": 0.048930366,
"p95": 0.048930366,
"p98": 0.048930366,
"p99": 0.048930366,
"p999": 0.048930366,
"stddev": 0.3961819852037625,
"m15_rate": 0.0016325388459502614,
"m1_rate": 0.0000015369791915358426,
"m5_rate": 0.0006317127755974434,
"mean_rate": 0.0032760240366080024,
"duration_units": "seconds",
"rate_units": "calls/second"
Of all your queries the slowest took 8 minutes to complete while the others completed very quickly. Best to identify the slow query and profile it.

In EKS, Worker pods going offline abruptly with 'hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection'

Our Environment:
Jenkins version - Jenkins 2.319.1
Jenkins Master image : jenkins/jenkins:2.319.1-lts-alpine
Jenkins worker image: jenkins/inbound-agent:4.11-1-alpine
Installed plugins:
Kubernetes - 1.30.6
Kubernetes Client API - 5.4.1
Kubernetes Credentials Plugin - 0.9.0
JAVA version on master: openjdk 11.0.13
JAVA version on Agent/worker : openjdk 11.0.14
Hi team,
We are facing issue in jenkins where jenkins agent disconnects(or goes offline) from master while still job is running on agent/worker. We are getting below error(highlighted) and tried below things but issue is still not resolving fully. Jenkins is deployed on EKS.
Error:
5334535:2022-11-02 14:07:54.573+0000 [id=140290] INFO hudson.slaves.NodeProvisioner#update: worker-7j4x4 provisioning successfully completed. We have now 2 computer(s)
5334695:2022-11-02 14:07:54.675+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Created Pod: kubernetes done-jenkins/worker-7j4x4
5334828:2022-11-02 14:07:56.619+0000 [id=140291] INFO o.c.j.p.k.KubernetesLauncher#launch: Pod is running: kubernetes done-jenkins/worker-7j4x4
5334964-2022-11-02 14:07:58.650+0000 [id=140309] INFO h.TcpSlaveAgentListener$ConnectionHandler#run: Accepted JNLP4-connect connection #97 from /100.122.254.111:42648
5335123-2022-11-02 14:09:19.733+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5335275-2022-11-02 14:09:19.733+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5335409-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2608, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5335965-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 1 nodes assigned to this Jenkins instance, which we will check
5336139-2022-11-02 14:09:19.734+0000 [id=140536] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5336279-2022-11-02 14:09:19.734+0000 [id=140536] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5336438-groovy.lang.MissingPropertyException: No such property: envVar for class: groovy.lang.Binding
5336532- at groovy.lang.Binding.getVariable(Binding.java:63)
5336585- at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onGetProperty(SandboxInterceptor.java:271)
–
5394279-2022-11-02 15:09:19.733+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5394431-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5394565-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2620, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
5395121-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#loadNodeMap: We currently have 3 nodes assigned to this Jenkins instance, which we will check
5395295-2022-11-02 15:09:19.734+0000 [id=141899] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog check has been completed
5395435-2022-11-02 15:09:19.734+0000 [id=141899] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished DockerContainerWatchdog Asynchronous Periodic Work. 1 ms
5395594-2022-11-02 15:11:59.502+0000 [id=140320] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-254-111.eu-central-1.compute.internal/100.122.254.111:42648.
5395817-java.util.concurrent.TimeoutException: Ping started at 1667401679501 hasn't completed by 1667401919502
5395920- at hudson.remoting.PingThread.ping(PingThread.java:134)
5395977- at hudson.remoting.PingThread.run(PingThread.java:90)
5396032:2022-11-02 15:11:59.503+0000 [id=141914] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5049 for worker-7j4x4 terminated: java.nio.channels.ClosedChannelException
5396231-2022-11-02 15:12:35.579+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started Periodic background build discarder
5396368-2022-11-02 15:12:36.257+0000 [id=141933] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Finished Periodic background build discarder. 678 ms
5396514-2022-11-02 15:14:15.582+0000 [id=141422] INFO hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from ip-100-122-237-38.eu-central-1.compute.internal/100.122.237.38:55038.
5396735-java.util.concurrent.TimeoutException: Ping started at 1667401815582 hasn't completed by 1667402055582
5396838- at hudson.remoting.PingThread.ping(PingThread.java:134)
5396895- at hudson.remoting.PingThread.run(PingThread.java:90)
5396950-2022-11-02 15:14:15.584+0000 [id=141915] INFO j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting 5050 for worker-fjf1p terminated: java.nio.channels.ClosedChannelException
****5397149-2022-11-02 15:14:19.733+0000 [id=141950] INFO hudson.model.AsyncPeriodicWork#lambda$doRun$1: Started DockerContainerWatchdog Asynchronous Periodic Work
5397301-2022-11-02 15:14:19.733+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog#execute: Docker Container Watchdog has been triggered
5397435-2022-11-02 15:14:19.734+0000 [id=141950] INFO c.n.j.p.d.DockerContainerWatchdog$Statistics#writeStatisticsToLog: Watchdog Statistics: Number of overall executions: 2621, Executions with processing timeout: 0, Containers removed gracefully: 0, Containers removed with force: 0, Containers removal failed: 0, Nodes removed successfully: 0, Nodes removal failed: 0, Container removal average duration (gracefully): 0 ms, Container removal average duration (force): 0 ms, Average overall runtime of watchdog: 0 ms, Average runtime of container retrieval: 0 ms
Any suggestion or resolutions pls.
Tried below things:
Increased idleMinutes to 180 from default
Verified that resources are sufficient as per graphana dashboard
Changed podRetention to onFailure from Never
Changed podRetention to Always from Never
Increased readTimeout
Increased connectTimeout
Increased slaveConnectTimeoutStr
Disabled the ping thread from UI via disabling “response time" checkbox from preventive node monitroing
Increased activeDeadlineSeconds
Verified same java version on master and agent
Updated kubernetes and kubernetes API client plugins
Expectation is worker/agent should disconnect once job is successfully ran and after idleMinutes defined it should terminate but few times its terminating while job is still running on agent

Unused Passenger process stays alive and consumes server resources for a Rails 4 app

we have a Rails app that runs using Apache -> Passenger. At least once a week, our alerts that monitor server CPU and RAM start getting triggered on one or more of our app servers, and the root cause is that one or more of the Passenger processes are taking up a large chunk of the server CPU and RAM , without actually serving any requests.
for example, when i run "passenger-status" on the server that triggers these alerts, i see this:
Version : 5.3.1
Date : 2022-06-03 22:00:13 +0000
Instance: (Apache/2.4.51 (Amazon) OpenSSL/1.0.2k-fips Phusion_Passenger/5.3.1)
----------- General information -----------
Max pool size : 12
App groups : 1
Processes : 9
Requests in top-level queue : 0
----------- Application groups -----------
Requests in queue: 0
* PID: 16915 Sessions: 1 Processed: 3636 Uptime: 3h 2m 30s
CPU: 5% Memory : 1764M Last used: 0s ago
* PID: 11275 Sessions: 0 Processed: 34 Uptime: 55m 24s
CPU: 45% Memory : 5720M Last used: 35m 43s ago
...
see how the 2nd process hasn't been used for > 35 minutes but is taking up so much of the server resources?
the only solution has been to manually kill the PID which seems to resolve the issue, but is there a way to automate this check?
i also realize that the Passenger version is old and can be upgraded (which I will get done soon) but i have seen this issue in multiple versions prior to the current version, so i wasn't sure if an upgrade by itself is guaranteed to resolve this or not.

XGBoost model failed due to Closing connection _sid_af1c at exit

We use XGBoost model for regression prediction model, We use XGBoost as grid search hyper parameter tuning process,
We run this model on 90GB h2o cluster. This process now running over 1.2 years, but suddenly this process stop due to "Closing connection _sid_af1c at exit"
Training data set is 800 000, due to this error we decreased it to 500 000 but same error occurred.
ntrees - 300,400
depth - 8.10
variables - 382
I have attached H2o memory log and our application error log. Could you please support to fixed this issue.
----------------------------------------H2o Log [Start]----------------------
**We start H2o as 2 node cluster, but h2o log crated on one node.**
INFO water.default: ----- H2O started -----
INFO water.default: Build git branch: master
INFO water.default: Build git hash: 0588cccd72a7dc1274a83c30c4ae4161b92d9911
INFO water.default: Build git describe: jenkins-master-5236-4-g0588ccc
INFO water.default: Build project version: 3.33.0.5237
INFO water.default: Build age: 1 year, 3 months and 17 days
INFO water.default: Built by: 'jenkins'
INFO water.default: Built on: '2020-10-27 19:21:29'
WARN water.default:
WARN water.default: *** Your H2O version is too old! Please download the latest version from http://h2o.ai/download/ ***
WARN water.default:
INFO water.default: Found H2O Core extensions: [XGBoost, KrbStandalone]
INFO water.default: Processed H2O arguments: [-flatfile, /usr/local/h2o/flatfile.txt, -port, 54321]
INFO water.default: Java availableProcessors: 20
INFO water.default: Java heap totalMemory: 962.5 MB
INFO water.default: Java heap maxMemory: 42.67 GB
INFO water.default: Java version: Java 1.8.0_262 (from Oracle Corporation)
INFO water.default: JVM launch parameters: [-Xmx48g]
INFO water.default: JVM process id: 83043#masterb.xxxxx.com
INFO water.default: OS version: Linux 3.10.0-1127.10.1.el7.x86_64 (amd64)
INFO water.default: Machine physical memory: 62.74 GB
INFO water.default: Machine locale: en_US
INFO water.default: X-h2o-cluster-id: 1644769990156
INFO water.default: User name: 'root'
INFO water.default: IPv6 stack selected: false
INFO water.default: Possible IP Address: ens192 (ens192), xxxxxxxxxxxxxxxxxxxx
INFO water.default: Possible IP Address: ens192 (ens192), xxxxxxxxxxx
INFO water.default: Possible IP Address: lo (lo), 0:0:0:0:0:0:0:1%lo
INFO water.default: Possible IP Address: lo (lo), 127.0.0.1
INFO water.default: H2O node running in unencrypted mode.
INFO water.default: Internal communication uses port: 54322
INFO water.default: Listening for HTTP and REST traffic on http://xxxxxxxxxxxx:54321/
INFO water.default: H2O cloud name: 'root' on /xxxxxxxxxxxx:54321, discovery address /xxxxxxxxxxxx:57653
INFO water.default: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
INFO water.default: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 root#xxxxxxxxxxxx'
INFO water.default: 2. Point your browser to http://localhost:55555
INFO water.default: Log dir: '/tmp/h2o-root/h2ologs'
INFO water.default: Cur dir: '/usr/local/h2o/h2o-3.33.0.5237'
INFO water.default: Subsystem for distributed import from HTTP/HTTPS successfully initialized
INFO water.default: HDFS subsystem successfully initialized
INFO water.default: S3 subsystem successfully initialized
INFO water.default: GCS subsystem successfully initialized
INFO water.default: Flow dir: '/root/h2oflows'
INFO water.default: Cloud of size 1 formed [/xxxxxxxxxxxx:54321]
INFO water.default: Registered parsers: [GUESS, ARFF, XLS, SVMLight, AVRO, PARQUET, CSV]
INFO water.default: XGBoost extension initialized
INFO water.default: KrbStandalone extension initialized
INFO water.default: Registered 2 core extensions in: 2632ms
INFO water.default: Registered H2O core extensions: [XGBoost, KrbStandalone]
INFO hex.tree.xgboost.XGBoostExtension: Found XGBoost backend with library: xgboost4j_gpu
INFO hex.tree.xgboost.XGBoostExtension: XGBoost supported backends: [WITH_GPU, WITH_OMP]
INFO water.default: Registered: 217 REST APIs in: 353ms
INFO water.default: Registered REST API extensions: [Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4]
INFO water.default: Registered: 291 schemas in 112ms
INFO water.default: H2O started in 4612ms
INFO water.default:
INFO water.default: Open H2O Flow in your web browser: http://xxxxxxxxxxxx:54321
INFO water.default:
INFO water.default: Cloud of size 2 formed [mastera.xxxxxxxxxxxx.com/xxxxxxxxxxxx:54321, masterb.xxxxxxxxxxxx.com/xxxxxxxxxxxx:54321]
INFO water.default: Locking cloud to new members, because water.rapids.Session$1
INFO hex.tree.xgboost.task.XGBoostUpdater: Initial Booster created, size=448
ERROR water.default: Got IO error when sending a batch of bytes:
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:51)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:468)
at water.H2ONode$SmallMessagesSendThread.sendBuffer(H2ONode.java:605)
at water.H2ONode$SmallMessagesSendThread.run(H2ONode.java:588)
----------------------------------------H2o Log [End]--------------------------------
----------------------------------------Application Log [Start]----------------------
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
Warning: Your H2O cluster version is too old (1 year, 3 months and 17 days)! Please download and install the latest version from http://h2o.ai/download/
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
release memory here...
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
Warning: Your H2O cluster version is too old (1 year, 3 months and 17 days)! Please download and install the latest version from http://h2o.ai/download/
-------------------------- ------------------------------------------------------------------
H2O_cluster_uptime: 19 mins 49 secs
H2O_cluster_timezone: Asia/Colombo
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.33.0.5237
H2O_cluster_version_age: 1 year, 3 months and 17 days !!!
H2O_cluster_name: root
H2O_cluster_total_nodes: 2
H2O_cluster_free_memory: 84.1 Gb
H2O_cluster_total_cores: 40
H2O_cluster_allowed_cores: 40
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.0 final
-------------------------- ------------------------------------------------------------------
Parse progress: |█████████████████████████████████████████████████████████| 100%
xgboost Grid Build progress: |████████Closing connection _sid_af1c at exit
H2O session _sid_af1c was not closed properly.
Closing connection _sid_9313 at exit
H2O session _sid_9313 was not closed properly.
----------------------------------------Application Log [End]----------------------
This typically means one of the nodes crashed, it can be due to many different reasons - memory is the most common one.
I see your machine has about 64GB of physical memory and H2O is getting 48GB out of that. XGBoost runs in native memory, not in the JVM memory. For XGBoost we recommend splitting the physical memory 50-50 to H2O and XGBoost.
You are running a development version of H2O (3.33) - I suggest upgrading to the latest stable.

How to improve Nginx, Rails, Passenger memory usage?

I currently have a rails app set up on a Digital Ocean VPS (1GB RAM) trough Cloud 66. The problem being that the VPS' memory runs full with Passenger processes.
The output of passenger-status:
# passenger-status
Version : 4.0.45
Date : 2014-09-23 09:04:37 +0000
Instance: 1762
----------- General information -----------
Max pool size : 2
Processes : 2
Requests in top-level queue : 0
----------- Application groups -----------
/var/deploy/cityspotters/web_head/current#default:
App root: /var/deploy/cityspotters/web_head/current
Requests in queue: 0
* PID: 7675 Sessions: 0 Processed: 599 Uptime: 39m 35s
CPU: 1% Memory : 151M Last used: 1m 10s ago
* PID: 7686 Sessions: 0 Processed: 477 Uptime: 39m 34s
CPU: 1% Memory : 115M Last used: 10s ago
The max_pool_size seems to be configured correctly.
The output of passenger-memory-stats:
# passenger-memory-stats
Version: 4.0.45
Date : 2014-09-23 09:10:41 +0000
------------- Apache processes -------------
*** WARNING: The Apache executable cannot be found.
Please set the APXS2 environment variable to your 'apxs2' executable's filename, or set the HTTPD environment variable to your 'httpd' or 'apache2' executable's filename.
--------- Nginx processes ---------
PID PPID VMSize Private Name
-----------------------------------
1762 1 51.8 MB 0.4 MB nginx: master process /opt/nginx/sbin/nginx
7616 1762 53.0 MB 1.8 MB nginx: worker process
### Processes: 2
### Total private dirty RSS: 2.22 MB
----- Passenger processes -----
PID VMSize Private Name
-------------------------------
7597 218.3 MB 0.3 MB PassengerWatchdog
7600 565.7 MB 1.1 MB PassengerHelperAgent
7606 230.8 MB 1.0 MB PassengerLoggingAgent
7675 652.0 MB 151.7 MB Passenger RackApp: /var/deploy/cityspotters/web_head/current
7686 652.1 MB 116.7 MB Passenger RackApp: /var/deploy/cityspotters/web_head/current
### Processes: 5
### Total private dirty RSS: 270.82 MB
.. 2 Passenger RackApp processes, OK.
But when I use the htop command, the output is as follows:
There seem to be a lot of Passenger Rackup processes. We're also running Sidekiq with the default configuration.
New Relic Server reports the following memory usage:
I tried tuning Passenger settings, adding a load balancer and another server but honestly don't know what to do from here. How can I find out what's causing so much memory usage?
Update: I had to restart ngnix because of some changes and it seemed to free quite a lot of memory.
Press Shift-H to hide threads in htop. Those aren't processes but threads within a process. The key column is RSS: you have two passenger processes at 209MB and 215MB and one Sidekiq process at 154MB.
Short answer: this is completely normal memory usage for a Rails app. 1GB is simply a little small if you want multiple processes for each. I'd cut down passenger to one process.
Does your application create child processes? If so, then it's likely that those extra "Passenger RackApp" processes are not actually processes created by Phusion Passenger, but are in fact processes created by your own app. You should double check whether your app spawns child processes and whether you clean up those child processes correctly. Also double check whether any libraries you use, also properly clean up their child processes.
I see that you're using Sidekiq and you've configured 25 Sidekiq processes. Those are also eating a lot of memory. A Sidekiq process eats just as much memory as a Passenger RackApp process, because both of them load your entire application (including Rails) in memory. Try reducing the number of Sidekiq processes.

Resources