I have a hazelcast cluster with two machines.
The only object in the cluster is a map. Analysing the log files I noticed that the health monitor starts to report a slow increase in memory consumption even though no new entries are being added to map (see sample of log entries below)
Any ideas of what may be causing the memory increase?
<p>2015-09-16 10:45:49 INFO HealthMonitor:? - [10.11.173.129]:5903
[dev] [3.2.1] memory.used=97.6M, memory.free=30.4M,
memory.total=128.0M, memory.max=128.0M, memory.used/total=76.27%,
memory.used/max=76.27%, load.process=0.00%, load.system=1.00%,
load.systemAverage=3.00%, thread.count=96, thread.peakCount=107,
event.q.size=0, executor.q.async.size=0, executor.q.client.size=0,
executor.q.operation.size=0, executor.q.query.size=0,
executor.q.scheduled.size=0, executor.q.io.size=0,
executor.q.system.size=0, executor.q.operation.size=0,
executor.q.priorityOperation.size=0, executor.q.response.size=0,
operations.remote.size=1, operations.running.size=0, proxy.count=2,
clientEndpoint.count=0, connection.active.count=2,
connection.count=2</p>
<p>2015-09-16 10:46:02 INFO
InternalPartitionService:? - [10.11.173.129]:5903 [dev] [3.2.1]
Remaining migration tasks in queue = 51 2015-09-16 10:46:12 DEBUG
TeleavisoIvrLoader:71 - Checking for new files... 2015-09-16 10:46:13
INFO InternalPartitionService:? - [10.11.173.129]:5903 [dev] [3.2.1]
All migration tasks has been completed, queues are empty. 2015-09-16
10:46:19 INFO HealthMonitor:? - [10.11.173.129]:5903 [dev] [3.2.1]
memory.used=103.9M, memory.free=24.1M, memory.total=128.0M,
memory.max=128.0M, memory.used/total=81.21%, memory.used/max=81.21%,
load.process=0.00%, load.system=1.00%, load.systemAverage=2.00%,
thread.count=73, thread.peakCount=107, event.q.size=0,
executor.q.async.size=0, executor.q.client.size=0,
executor.q.operation.size=0, executor.q.query.size=0,
executor.q.scheduled.size=0, executor.q.io.size=0,
executor.q.system.size=0, executor.q.operation.size=0,
executor.q.priorityOperation.size=0, executor.q.response.size=0,
operations.remote.size=0, operations.running.size=0, proxy.count=2,
clientEndpoint.count=0, connection.active.count=2,
connection.count=2</p>
<p>2015-09-16 10:46:49 INFO HealthMonitor:? - [10.11.173.129]:5903
[dev] [3.2.1] memory.used=105.1M, memory.free=22.9M,
memory.total=128.0M, memory.max=128.0M, memory.used/total=82.11%,
memory.used/max=82.11%, load.process=0.00%, load.system=1.00%,
load.systemAverage=1.00%, thread.count=73, thread.peakCount=107,
event.q.size=0, executor.q.async.size=0, executor.q.client.size=0,
executor.q.operation.size=0, executor.q.query.size=0,
executor.q.scheduled.size=0, executor.q.io.size=0,
executor.q.system.size=0, executor.q.operation.size=0,
executor.q.priorityOperation.size=0, executor.q.response.size=0,
operations.remote.size=0, operations.running.size=0, proxy.count=2,
clientEndpoint.count=0, connection.active.count=2,
connection.count=2</p>
Related
I have setup a simple standalone spark cluster using docker.
I have two docker containers that have spark and one is running as master and anther as worker.
These two containers share a custom bridge network.
I have opened up webui and submit port of master container and can successfully see master webui.
Worker container webui port is also opened and can successfully be seen from my web browser.
The problem happens when I try to run pyspark.
Now from my machine(not inside and docker container), I try to run the following python script but it never completes.
from pyspark import SparkContext, SparkConf
sc = SparkContext("spark://localhost:10077","test")
rdd = sc.parallelize([1,2,3])
for a in rdd.collect():
print(a)
I checked the master and worker webui and it seems that worker node is continuously exiting and recreating a executer.
The error of executor is like this:
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:424)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:413)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:444)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
Caused by: java.io.IOException: Failed to connect to devmachine/172.17.0.1:40314
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: devmachine/172.17.0.1:40314
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
I think the main cause it that the worker cannot connect back to the driver(devmachine) with a given port.
However, I have configured the master and worker container to be able to resolve "devmachine" ip address and have confirmed that the resolution works(tested separately with a simple flask running on devmachine and make http request from a worker container. it worked)
From what I understand, I am submitting my pyspark job in "client" mode (instead of cluster mode) and thus my current machine(devmachine) is where driver exists and since client mode states that the workers communicate with the driver, thus the error where worker is trying to connect to devmachine.
Any pointers on what I am doing wrong?
update 22.07.21
okay I have eventually found a workaround to make this simple spark job to work.
Since the original problem seemed to be caused by difficulty of network communication between the driver(host machine) and the worker docker container for some unknown reason, I though perhaps if we submitted the spark job from another docker container which is bridge networked to the master & worker docker container could solve the problem.
I prepare a docker image on top of the image that was used for master/worker, where I only installed python3 and pip and installed pyspark since it lacked these packages.
After that, I lauched a docker container with this new image, made sure that it was connected to the custom bridge network used by master and worker node.
I submitted the python script (after changing the master url inside the python file to spark://master:7077) with the following command:
$ spark-submit t1.py
and it worked well!
Here was the output:
22/07/21 07:25:49 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/07/21 07:25:49 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/07/21 07:25:49 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ad4336e4-af0c-4100-852f-6b51f8803f94
22/07/21 07:25:49 INFO MemoryStore: MemoryStore started with capacity 434.4 MiB
22/07/21 07:25:49 INFO SparkEnv: Registering OutputCommitCoordinator
22/07/21 07:25:49 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/07/21 07:25:49 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://master:7077...
22/07/21 07:25:49 INFO TransportClientFactory: Successfully created connection to master/172.18.0.2:7077 after 27 ms (0 ms spent in bootstraps)
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20220721072549-0005
22/07/21 07:25:50 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220721072549-0005/0 on worker-20220721060838-172.18.0.3-45161 (172.18.0.3:45161) with 8 core(s)
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: Granted executor ID app-20220721072549-0005/0 on hostPort 172.18.0.3:45161 with 8 core(s), 1024.0 MiB RAM
22/07/21 07:25:50 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37799.
22/07/21 07:25:50 INFO NettyBlockTransferService: Server created on 46dd0a966c6f:37799
22/07/21 07:25:50 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/07/21 07:25:50 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManagerMasterEndpoint: Registering block manager 46dd0a966c6f:37799 with 434.4 MiB RAM, BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 46dd0a966c6f, 37799, None)
22/07/21 07:25:50 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220721072549-0005/0 is now RUNNING
22/07/21 07:25:50 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/07/21 07:25:50 INFO SparkContext: Starting job: collect at /root/t1.py:10
22/07/21 07:25:50 INFO DAGScheduler: Got job 0 (collect at /root/t1.py:10) with 2 output partitions
22/07/21 07:25:50 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /root/t1.py:10)
22/07/21 07:25:50 INFO DAGScheduler: Parents of final stage: List()
22/07/21 07:25:50 INFO DAGScheduler: Missing parents: List()
22/07/21 07:25:50 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274), which has no missing parents
22/07/21 07:25:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.9 KiB, free 434.4 MiB)
22/07/21 07:25:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1757.0 B, free 434.4 MiB)
22/07/21 07:25:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 46dd0a966c6f:37799 (size: 1757.0 B, free: 434.4 MiB)
22/07/21 07:25:50 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1513
22/07/21 07:25:50 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274) (first 15 tasks are for partitions Vector(0, 1))
22/07/21 07:25:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks resource profile 0
22/07/21 07:25:52 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.18.0.3:59260) with ID 0, ResourceProfileId 0
22/07/21 07:25:52 INFO BlockManagerMasterEndpoint: Registering block manager 172.18.0.3:33016 with 434.4 MiB RAM, BlockManagerId(0, 172.18.0.3, 33016, None)
22/07/21 07:25:52 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (172.18.0.3, executor 0, partition 0, PROCESS_LOCAL, 4464 bytes) taskResourceAssignments Map()
22/07/21 07:25:52 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1) (172.18.0.3, executor 0, partition 1, PROCESS_LOCAL, 4491 bytes) taskResourceAssignments Map()
22/07/21 07:25:52 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.18.0.3:33016 (size: 1757.0 B, free: 434.4 MiB)
22/07/21 07:25:52 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 456 ms on 172.18.0.3 (executor 0) (1/2)
22/07/21 07:25:52 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 446 ms on 172.18.0.3 (executor 0) (2/2)
22/07/21 07:25:52 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/07/21 07:25:52 INFO DAGScheduler: ResultStage 0 (collect at /root/t1.py:10) finished in 2.357 s
22/07/21 07:25:52 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
22/07/21 07:25:52 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
22/07/21 07:25:52 INFO DAGScheduler: Job 0 finished: collect at /root/t1.py:10, took 2.401307 s
1
2
3
22/07/21 07:25:53 INFO SparkContext: Invoking stop() from shutdown hook
22/07/21 07:25:53 INFO SparkUI: Stopped Spark web UI at http://46dd0a966c6f:4040
22/07/21 07:25:53 INFO StandaloneSchedulerBackend: Shutting down all executors
22/07/21 07:25:53 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
22/07/21 07:25:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/07/21 07:25:53 INFO MemoryStore: MemoryStore cleared
22/07/21 07:25:53 INFO BlockManager: BlockManager stopped
22/07/21 07:25:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/07/21 07:25:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/07/21 07:25:53 INFO SparkContext: Successfully stopped SparkContext
22/07/21 07:25:53 INFO ShutdownHookManager: Shutdown hook called
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-aca43d84-54ab-45a6-87ec-25a644286af0
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-aca43d84-54ab-45a6-87ec-25a644286af0/pyspark-c7c9f3aa-5810-405b-a375-28e8b3344f68
22/07/21 07:25:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-bfcd5ac6-e629-4fdb-8621-fc76d17ffed3
and here is the worker node's output for this job to compare with the original error log.
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1894)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:424)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:413)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:444)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:985)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:984)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:442)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
... 4 more
Caused by: java.io.IOException: Failed to connect to cl-aicrdev03/172.17.0.1:44731
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:288)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: cl-aicrdev03/172.17.0.1:44731
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:710)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)
It seems that it had no problem connecting to the driver this time.
But then still it begs theh question.
Why does this simply spark submit work when the driver is a docker container sharing the same bridge network with the master & worker
while it doesn't work when the driver is outside of docker and the worker container has been ensured(I think..) to be able to look at the driver outside?
I have been trying to understand an issue I've had when running roribio16/alpine-sqs docker image on one of my machines. Whenever I try to run the image without specifying any other settings, docker run roribio16/alpine-sqs
[xxxx#yyyy ~]$ docker run roribio16/alpine-sqs
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/elasticmq.conf" during parsing
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/insight.conf" during parsing
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/sqs-init.conf" during parsing
2021-05-29 15:48:41,216 INFO Set uid to user 0 succeeded
2021-05-29 15:48:41,222 INFO RPC interface 'supervisor' initialized
2021-05-29 15:48:41,222 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2021-05-29 15:48:41,222 INFO supervisord started with pid 1
2021-05-29 15:48:42,225 INFO spawned: 'sqs-init' with pid 9
2021-05-29 15:48:42,229 INFO spawned: 'elasticmq' with pid 10
2021-05-29 15:48:42,230 INFO spawned: 'insight' with pid 11
cp: can't stat '/opt/custom/*.conf': No such file or directory
> sqs-insight#0.3.0 start /opt/sqs-insight
> node index.js
15:48:42.605 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
Loading config file from "/opt/sqs-insight/lib/../config/config_local.json"
15:48:42.929 [elasticmq-akka.actor.default-dispatcher-2] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
Unable to load queues for undefined
Config contains 0 queues.
library initialization failed - unable to allocate file descriptor table - out of memorylistening on port 9325
2021-05-29 15:48:43,233 INFO success: sqs-init entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,233 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,234 INFO success: insight entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,234 INFO exited: sqs-init (exit status 0; expected)
2021-05-29 15:48:44,318 INFO exited: elasticmq (terminated by SIGABRT (core dumped); not expected)
2021-05-29 15:48:45,322 INFO spawned: 'elasticmq' with pid 67
15:48:45.743 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
15:48:46.044 [elasticmq-akka.actor.default-dispatcher-2] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
library initialization failed - unable to allocate file descriptor table - out of memory2021-05-29 15:48:47,223 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:47,389 INFO exited: elasticmq (terminated by SIGABRT (core dumped); not expected)
2021-05-29 15:48:48,393 INFO spawned: 'elasticmq' with pid 89
15:48:48.766 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
15:48:49.066 [elasticmq-akka.actor.default-dispatcher-3] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
library initialization failed - unable to allocate file descriptor table - out of memory^C2021-05-29 15:48:49,559 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:49,559 WARN received SIGINT indicating exit request
2021-05-29 15:48:49,559 INFO waiting for insight, elasticmq to die
2021-05-29 15:48:49,566 INFO stopped: insight (terminated by SIGTERM)
2021-05-29 15:48:50,431 INFO stopped: elasticmq (terminated by SIGABRT (core dumped))
With a bit of googling I found this post where somebody had the same issue when running some other random image, and then posted that they managed to get the image running by setting some ulimits when running the image, which also worked for me (docker run --ulimit nofile=122880:122880 roribio16/alpine-sqs).
I checked the ulimits set inside the container when I didn't use this configuration
docker exec -it ca bash
$ ulimit -a
and found that the nofile setting was ridiculously high, which I assume is what is causing the container to run out of memory, if too many files are being opened simultaneously. I don't have a particulary good understanding of how this works though so would appreciate any clarification somebody could shed on that particular topic also.
Anyway the point of that ramble is that I want to try and find where the default docker container ulimits are set as I don't understand why they are so high on the machine I am using. I have another machine that does not have this problem.
I can find lots of ways to change the default limits but there does not seem to be much information about where these limits get set in the first place. I understand according to the docker documentation that if custom values are not set then the ulimits should be inherited from my system but as far as I can tell my system nofile settings are much lower than what I'm seeing in the container.
(Both machines run manjaro linux however the one that doesn't have this issue is XFCE and the one that does is KDE).
I've been trying to restart neo4j after adding new data on an EC2 instance. I stopped the neo4j instance, then I called systemctl start neo4j, but when I call cypher-shell it says Connection refused, and connection to the browser port doesn't work anymore.
In the beginning I assumed it was a heap space problem, since looking at the debug.log it said there was a memory issue. I adjusted the heap space and cache settings in neo4j.conf as recommended by neo4j-admin memrec, but still neo4j won't start.
Then I assumed it was because my APOC package was outdated. My neo4j version is 3.5.6, but APOC is 3.5.0.3. I download the latest 3.5.0.4 version, but still neo4j won't start.
At last I tried chmod 777 on every file in the data/database and plugin directories and the directories themselves, but still neo4j won't start.
What's strange is when I try neo4j console for all of these attempts, both cypher-shell and the neo4j browser port works just fine. However, obviously I would prefer to be able to launch neo4j with systemctl.
Right now the only hint of error I can find in debug.log is the following:
2019-06-19 21:19:55.508+0000 INFO [o.n.i.d.DiagnosticsManager] Storage summary:
2019-06-19 21:19:55.508+0000 INFO [o.n.i.d.DiagnosticsManager] Total size of store: 3.07 GB
2019-06-19 21:19:55.509+0000 INFO [o.n.i.d.DiagnosticsManager] Total size of mapped files: 3.07 GB
2019-06-19 21:19:55.509+0000 INFO [o.n.i.d.DiagnosticsManager] --- STARTED diagnostics for KernelDiagnostics:StoreFiles
END ---
2019-06-19 21:19:55.509+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Fulfilling of requirement 'Database available' mak
es database available.
2019-06-19 21:19:55.509+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is ready.
2019-06-19 21:19:55.568+0000 INFO [o.n.k.i.DatabaseHealth] Database health set to OK
2019-06-19 21:19:56.198+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3URLConnection` from plugin jar `
/var/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/ClientConfiguration
2019-06-19 21:19:56.199+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3Aws` from plugin jar `/var/lib/n
eo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/auth/AWSCredentials
2019-06-19 21:19:56.200+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.s3.S3Aws$1` from plugin jar `/var/lib
/neo4j/plugins/apoc-3.5.0.4-all.jar`: com/amazonaws/services/s3/model/S3ObjectInputStream
2019-06-19 21:19:56.207+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.hdfs.HDFSUtils$1` from plugin jar `/v
ar/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: org/apache/hadoop/fs/FSDataInputStream
2019-06-19 21:19:56.208+0000 WARN [o.n.k.i.p.Procedures] Failed to load `apoc.util.hdfs.HDFSUtils` from plugin jar `/var
/lib/neo4j/plugins/apoc-3.5.0.4-all.jar`: org/apache/hadoop/fs/FSDataOutputStream
...
...
...
2019-06-19 21:20:00.678+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutting down database.
2019-06-19 21:20:00.679+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-06-19 21:20:00.679+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is unavailable.
2019-06-19 21:20:00.684+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Database shutdown" # txId: 1
checkpoint started...
2019-06-19 21:20:00.704+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Database shutdown" # txId: 1
checkpoint completed in 20ms
2019-06-19 21:20:00.705+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in vers
ion 0
2019-06-19 21:20:00.725+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics START ---
2019-06-19 21:20:00.725+0000 INFO [o.n.i.d.DiagnosticsManager] --- STOPPING diagnostics END ---
2019-06-19 21:20:00.725+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Shutdown started
2019-06-19 21:20:05.875+0000 INFO [o.n.g.f.m.e.CommunityEditionModule] No locking implementation specified, defaulting
to 'community'
2019-06-19 21:20:06.080+0000 INFO [o.n.g.f.GraphDatabaseFacadeFactory] Creating database.
2019-06-19 21:20:06.154+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Requirement `Database available` makes database unavailable.
2019-06-19 21:20:06.156+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] Database is unavailable.
2019-06-19 21:20:06.183+0000 INFO [o.n.i.d.DiagnosticsManager] --- INITIALIZED diagnostics START ---
I think the warning isn't an issue, since it's just a warning and not an error or exception. Also it seems that the database just shuts down automatically, and then restarts, creating an infinite loop. This loop does not happen when I call neo4j console (all the warnings still exist in the logs). All my ports are default.
Any clue why this is happening? I've never encountered this error when I previously launched neo4j on this instance.
If it works with neo4j console but not with systemctl, you should check the rights of the Neo4j folder.
I'm pretty sure you have a problem on it, and that the systemctl doesn't run Neo4j with the same user as you
I am running neo4j on an EC2 instance. But for some reason it randomly shuts down from time to time. Is there a way to check the shutdown logs? And is there a way to automatically restart the server? I couldn't locate the log folder. But here's what my messages.log file looks like. This section covers the timeframe when the server went down (before 2015-04-13 05:39:59.084+0000) and when I manually restarted the server (at 2015-04-13 05:39:59.084+0000). You can see that there is no record of server issue or shutdown. Time frame before 2015-03-05 08:18:47.084+0000 contains info of the previous server restart.
2015-03-05 08:18:44.180+0000 INFO [o.n.s.m.Neo4jBrowserModule]: Mounted Neo4j Browser at [/browser]
2015-03-05 08:18:44.253+0000 INFO [o.n.s.w.Jetty9WebServer]: Mounting static content at [/webadmin] from [webadmin-html]
2015-03-05 08:18:44.311+0000 INFO [o.n.s.w.Jetty9WebServer]: Mounting static content at [/browser] from [browser]
2015-03-05 08:18:47.084+0000 INFO [o.n.s.CommunityNeoServer]: Server started on: http://0.0.0.0:7474/
2015-03-05 08:18:47.084+0000 INFO [o.n.s.CommunityNeoServer]: Remote interface ready and available at [http://0.0.0.0:7474/]
2015-03-05 08:18:47.084+0000 INFO [o.n.k.i.DiagnosticsManager]: --- SERVER STARTED END ---
2015-04-13 05:39:59.084+0000 INFO [o.n.s.CommunityNeoServer]: Setting startup timeout to: 120000ms based on -1
2015-04-13 05:39:59.265+0000 INFO [o.n.k.InternalAbstractGraphDatabase]: No locking implementation specified, defaulting to 'community'
2015-04-13 05:39:59.383+0000 INFO [o.n.k.i.DiagnosticsManager]: --- INITIALIZED diagnostics START ---
2015-04-13 05:39:59.384+0000 INFO [o.n.k.i.DiagnosticsManager]: Neo4j Kernel properties:
2015-04-13 05:39:59.389+0000 INFO [o.n.k.i.DiagnosticsManager]: neostore.propertystore.db.mapped_memory=78M
2015-04-13 05:39:59.389+0000 INFO [o.n.k.i.DiagnosticsManager]: neostore.nodestore.db.mapped_memory=21M
I'm playing with Neo4J high availability clustering. Whilst the documentation indicates a cluster requires at least 3 nodes, or 2 with a arbitrator, I'm wondering what the implications of running with only 2 nodes are?
If i set up a 3 node cluster, and remove a node, i have no issues adding data. Likewise if i set-up the cluster with only 2 nodes i can still add data and don't seem to be restricted functionality. What should i expect to experience as limitations? For example, the following indicates the trace of a slave started in a 2 node cluster. Data can be added to the master with no issues - and be queried.
2013-11-06 10:34:50.403+0000 INFO [Cluster] Attempting to join cluster of [127.0.0.1:5001, 127.0.0.1:5002]
2013-11-06 10:34:54.473+0000 INFO [Cluster] Joined cluster:Name:neo4j.ha Nodes:{1=cluster://127.0.0.1:5001, 2=cluster://127.0.0.1:5002} Roles:{coordinator=1}
2013-11-06 10:34:54.477+0000 INFO [Cluster] Instance 2 (this server) joined the cluster
2013-11-06 10:34:54.512+0000 INFO [Cluster] Instance 1 was elected as coordinator
2013-11-06 10:34:54.530+0000 INFO [Cluster] Instance 1 is available as master at ha://localhost:6363?serverId=1
2013-11-06 10:34:54.531+0000 INFO [Cluster] Instance 1 is available as backup at backup://localhost:6366
2013-11-06 10:34:54.537+0000 INFO [Cluster] ServerId 2, moving to slave for master ha://localhost:6363?serverId=1
2013-11-06 10:34:54.564+0000 INFO [Cluster] Checking store consistency with master
2013-11-06 10:34:54.620+0000 INFO [Cluster] The store does not represent the same database as master. Will remove and fetch a new one from master
2013-11-06 10:34:54.646+0000 INFO [Cluster] ServerId 2, moving to slave for master ha://localhost:6363?serverId=1
2013-11-06 10:34:54.658+0000 INFO [Cluster] Copying store from master
2013-11-06 10:34:54.687+0000 INFO [Cluster] Copying index/lucene-store.db
2013-11-06 10:34:54.688+0000 INFO [Cluster] Copied index/lucene-store.db
2013-11-06 10:34:54.688+0000 INFO [Cluster] Copying neostore.nodestore.db
2013-11-06 10:34:54.689+0000 INFO [Cluster] Copied neostore.nodestore.db
2013-11-06 10:34:54.689+0000 INFO [Cluster] Copying neostore.propertystore.db
2013-11-06 10:34:54.689+0000 INFO [Cluster] Copied neostore.propertystore.db
2013-11-06 10:34:54.689+0000 INFO [Cluster] Copying neostore.propertystore.db.arrays
2013-11-06 10:34:54.690+0000 INFO [Cluster] Copied neostore.propertystore.db.arrays
2013-11-06 10:34:54.690+0000 INFO [Cluster] Copying neostore.propertystore.db.index
2013-11-06 10:34:54.690+0000 INFO [Cluster] Copied neostore.propertystore.db.index
2013-11-06 10:34:54.690+0000 INFO [Cluster] Copying neostore.propertystore.db.index.keys
2013-11-06 10:34:54.691+0000 INFO [Cluster] Copied neostore.propertystore.db.index.keys
2013-11-06 10:34:54.691+0000 INFO [Cluster] Copying neostore.propertystore.db.strings
2013-11-06 10:34:54.691+0000 INFO [Cluster] Copied neostore.propertystore.db.strings
2013-11-06 10:34:54.691+0000 INFO [Cluster] Copying neostore.relationshipstore.db
2013-11-06 10:34:54.692+0000 INFO [Cluster] Copied neostore.relationshipstore.db
2013-11-06 10:34:54.692+0000 INFO [Cluster] Copying neostore.relationshiptypestore.db
2013-11-06 10:34:54.692+0000 INFO [Cluster] Copied neostore.relationshiptypestore.db
2013-11-06 10:34:54.692+0000 INFO [Cluster] Copying neostore.relationshiptypestore.db.names
2013-11-06 10:34:54.693+0000 INFO [Cluster] Copied neostore.relationshiptypestore.db.names
2013-11-06 10:34:54.693+0000 INFO [Cluster] Copying nioneo_logical.log.v0
2013-11-06 10:34:54.693+0000 INFO [Cluster] Copied nioneo_logical.log.v0
2013-11-06 10:34:54.693+0000 INFO [Cluster] Copying neostore
2013-11-06 10:34:54.694+0000 INFO [Cluster] Copied neostore
2013-11-06 10:34:54.694+0000 INFO [Cluster] Done, copied 12 files
2013-11-06 10:34:55.101+0000 INFO [Cluster] Finished copying store from master
2013-11-06 10:34:55.117+0000 INFO [Cluster] Checking store consistency with master
2013-11-06 10:34:55.123+0000 INFO [Cluster] Store is consistent
2013-11-06 10:34:55.124+0000 INFO [Cluster] Catching up with master
2013-11-06 10:34:55.125+0000 INFO [Cluster] Now consistent with master
2013-11-06 10:34:55.172+0000 INFO [Cluster] ServerId 2, successfully moved to slave for master ha://localhost:6363?serverId=1
2013-11-06 10:34:55.207+0000 INFO [Cluster] Instance 2 (this server) is available as slave at ha://localhost:6364?serverId=2
2013-11-06 10:34:55.261+0000 INFO [API] Successfully started database
2013-11-06 10:34:55.265+0000 INFO [Cluster] Database available for write transactions
2013-11-06 10:34:55.318+0000 INFO [API] Starting HTTP on port :8574 with 40 threads available
2013-11-06 10:34:55.614+0000 INFO [API] Enabling HTTPS on port :8575
2013-11-06 10:34:56.256+0000 INFO [API] Mounted REST API at: /db/manage/
2013-11-06 10:34:56.261+0000 INFO [API] Mounted discovery module at [/]
2013-11-06 10:34:56.341+0000 INFO [API] Loaded server plugin "CypherPlugin"
2013-11-06 10:34:56.344+0000 INFO [API] Loaded server plugin "GremlinPlugin"
2013-11-06 10:34:56.347+0000 INFO [API] Mounted REST API at [/db/data/]
2013-11-06 10:34:56.355+0000 INFO [API] Mounted management API at [/db/manage/]
2013-11-06 10:34:56.435+0000 INFO [API] Mounted webadmin at [/webadmin]
2013-11-06 10:34:56.477+0000 INFO [API] Mounting static content at [/webadmin] from [webadmin-html]
2013-11-06 10:34:57.923+0000 INFO [API] Remote interface ready and available at [http://localhost:8574/]
2013-11-06 10:35:52.829+0000 INFO [API] Available console sessions: SHELL: class org.neo4j.server.webadmin.console.ShellSessionCreator
CYPHER: class org.neo4j.server.webadmin.console.CypherSessionCreator
GREMLIN: class org.neo4j.server.webadmin.console.GremlinSessionCreator
Thanks
There is no implications in terms of functionality Neo4j server.
But in terms of high availability is better to have more then 2 servers in cluster.
If there is a network failure between the 2 nodes and they are running but can't see each other, they will both promote themselves to master.
This may result in problems reforming the cluster when the network recovers.
Adding a 3rd node ensures that only one of the 3 nodes can ever be master.