Dask 'distributed.comm.tcp - INFO - Connection closed before handshake completed' - dask

I deployed a Dask service to my K8s cluster with one scheduler, three workers, and one client connecting to the scheduler. When I connect to the scheduler (kubectl attach <my-scheduler-pod>), I get constant stdout messages consisting of:
distributed.comm.tcp - INFO - Connection closed before handshake completed
This shows up with four messages every six seconds or so. The four messages come in pretty close to each other. As far as I can tell, this isn't adversely affecting anything -- my service is running -- but the message itself as well as the constant nature doesn't seem to be a good thing.
What, if anything, should I do about this?

This is because of the dash schedular is running on a different dask version
from distributed.versions import get_versions
get_versions()
I replicated the same issue in local
My dask schedular is running on dask 2021.01.0 where as my client using 2021.0.1.03
Dask schedular :-
{'host': {'python': '3.8.0.final.0', 'python-bits': 64, 'OS': 'Linux', 'OS-release': '4.14.209-160.339.amzn2.x86_64', 'machine': 'x86_64', 'processor': '', 'byteorder': 'little', 'LC_ALL': 'C.UTF-8', 'LANG': 'C.UTF-8'}, 'packages': {'python': '3.8.0.final.0', 'dask': '2021.01.0', 'distributed': '2021.01.0', 'msgpack': '1.0.0', 'cloudpickle': '1.6.0', 'tornado': '6.1', 'toolz': '0.11.1', 'numpy': '1.18.1', 'lz4': '3.1.1', 'blosc': '1.9.2'}}
Client:-
{'host': {'python': '3.7.10.final.0', 'python-bits': 64, 'OS': 'Linux', 'OS-release': '4.14.214-160.339.amzn2.x86_64', 'machine': 'x86_64', 'processor': '', 'byteorder': 'little', 'LC_ALL': 'C.UTF-8', 'LANG': 'C.UTF-8'}, 'packages': {'python': '3.7.10.final.0', 'dask': '2021.03.0', 'distributed': '2021.03.0', 'msgpack': '1.0.2', 'cloudpickle': '1.6.0', 'tornado': '6.1', 'toolz': '0.11.1', 'numpy': None, 'lz4': None, 'blosc': None}
Make sure both are running the same version

Related

Dask Gateway, set worker resources

I am trying to set the resources for workers as per the docs here, but on a set up that uses Dask Gateway. Specifically, I'd like to be able to follow the answer to this question, but using Dask Gateway.
I haven't been able to find a reference to worker resources in the ClusterConfig options, and I tried the following (as per this answer), which doesn't seem to work:
def set_resources(dask_worker):
dask_worker.set_resources(task_limit=1)
return dask_worker.available_resources, dask_worker.total_resources
client.run(set_resources)
# output from a 1 worker cluster
> {'tls://255.0.91.211:39302': ({}, {})}
# checking info known by scheduler
cluster.scheduler_info
> {'type': 'Scheduler',
'id': 'Scheduler-410438c9-6b3a-494d-974a-52d9e9fss121',
'address': 'tls://255.0.44.161:8786',
'services': {'dashboard': 8787, 'gateway': 8788},
'started': 1632434883.9022279,
'workers': {'tls://255.0.92.232:39305': {'type': 'Worker',
'id': 'dask-worker-f95c163cf41647c6a6d85da9efa9919b-wvnf6',
'host': '255.0.91.211',
'resources': {}, #### still {} empty dict
'local_directory': '/home/jovyan/dask-worker-space/worker-ir8tpkz_',
'name': 'dask-worker-f95c157cf41647c6a6d85da9efa9919b-wvnf6',
'nthreads': 4,
'memory_limit': 6952476672,
'services': {'dashboard': 8787},
'nanny': 'tls://255.0.92.232:40499'}}}
How can this be done, either when the cluster is created using the config.yaml of the helm chart (ideally, a field in the cluster options that a user can change!) for Dask Gateway, or after the workers are already up and running?
I've found a way to specify this, at least on Kubernetes, is through the KubeClusterConfig.worker_extra_container_config. This is my yaml snippet for a working configuration (specifically, this is in my config for the daskhub helm deploy):
dask-gateway:
gateway:
backend:
worker:
extraContainerConfig:
env:
- name: DASK_DISTRIBUTED__WORKER__RESOURCES__TASKSLOTS
value: "1"
An option to set worker resources isn't exposed in the cluster options, and isn't explicitly exposed in the KubeClusterConfig. The specific format for the environment variable is described here. Resource environment variables need to be set before the dask worker process is started, I found it doesn't work when I set KubeClusterConfig.environment.
Using this, I am able to run multithreaded numpy (np.dot) using mkl in a dask worker container that has been given 4 cores. I see 400% CPU usage and only one task assigned to each worker.

Deploying smart contract using truffle on private blockchain node on docker

I am facing problems deploying a smart contract on my private blockchain network. I created my blockchain network on three VMs (miners) using puppeth on a fourth VM (controller) by following the steps in this blog: https://medium.com/#collin.cusce/using-puppeth-to-manually-create-an-ethereum-proof-of-authority-clique-network-on-aws-ae0d7c906cce
Afterwards, I installed truffle on one of the miners VMs and i initialized truffle using the command:
truffle init
Then I wrote a simple hello world smart contract, compiled it and deployed it on truffle development blockchain and it worked. However, I tried to deploy it on my private blockchain but I can't connect to the network.
The admin.nodeInfo command in geth console returns the folowing output:
docker exec -it 954cd3955065 geth attach ipc:/root/.ethereum/geth.ipc
Welcome to the Geth JavaScript console!
instance: Geth/v1.9.25-unstable-ead81461-20201123/linux-amd64/go1.15.5
coinbase: 0xe8cc4bea2cfdfd14cddefe1141bedd109576b9a9
at block: 78558 (Tue Dec 01 2020 22:01:02 GMT+0000 (UTC))
datadir: /root/.ethereum
modules: admin:1.0 clique:1.0 debug:1.0 eth:1.0 miner:1.0 net:1.0 personal:1.0 rpc:1.0 txpool:1.0 web3:1.0
To exit, press ctrl-d
> admin.nodeInfo
{
enode: "enode://7206ca3c62f6db47e1230dcf14a765d4c9b4870a66470dbb21fcc5ed2fab2167d6bcc47eec8044c42037b3e6e0017aeb8ddfc3580471da54a6c7274a0c1fe46b#10.100.2.32:30303",
enr: "enr:-Je4QGXlVAESp8r2s1uHBJxoDLWQo8IvZsbe5sX2YRBb0un9Gdlt8nfDKQBR_j0lDPtaoCCuis4cJJlqtEHfa4tLO2EIg2V0aMfGhG5b-B6AgmlkgnY0gmlwhApkAiCJc2VjcDI1NmsxoQNyBso8YvbbR-EjDc8Up2XUybSHCmZHDbsh_MXtL6shZ4N0Y3CCdl-DdWRwgnZf",
id: "027a351994ac1b127df56180b6210310cc0164f17f1b12c167cb167c4ffaa122",
ip: "10.100.2.32",
listenAddr: "[::]:30303",
name: "Geth/v1.9.25-unstable-ead81461-20201123/linux-amd64/go1.15.5",
ports: {
discovery: 30303,
listener: 30303
},
protocols: {
eth: {
config: {
byzantiumBlock: 0,
chainId: 1515,
clique: {...},
constantinopleBlock: 0,
eip150Block: 0,
eip150Hash: "0x0000000000000000000000000000000000000000000000000000000000000000",
eip155Block: 0,
eip158Block: 0,
homesteadBlock: 0,
istanbulBlock: 0,
petersburgBlock: 0
},
difficulty: 98201,
genesis: "0x17f752387c901db617cf0594ecd2cb9811dfcd666318c2e0e7cb0239471da979",
head: "0xf8a37d0390558746901faa55463c127c553f02cf2d23ce0cb469fcd470c810f9",
network: 1515
}
}
}
I tried adding the network configuration in truffle-config.js like this:
devnet2: {
host: "localhost",
port: "30303", //port where the node is
network_id: "*",
from: 0x91cd7b879fefff34259d577a56d290b3315bf9b3 // Treats this network as if it was a public net. (default: false)
}
then, when deploying using the command truffle deploy --network devnet2 i always get this error:
Compiling your contracts...
===========================
> Everything is up to date, there is nothing to compile.
/usr/local/lib/node_modules/truffle/build/webpack:/packages/provider/index.js:56
throw new Error(errorMessage);
^
Error: There was a timeout while attempting to connect to the network.
Check to see that your provider is valid.
If you have a slow internet connection, try configuring a longer timeout in your Truffle config. Use the networks[networkName].networkCheckTimeout property to do this.
at Timeout.setTimeout (/usr/local/lib/node_modules/truffle/build/webpack:/packages/provider/index.js:56:1)
at ontimeout (timers.js:436:11)
at tryOnTimeout (timers.js:300:5)
at listOnTimeout (timers.js:263:5)
at Timer.processTimers (timers.js:223:10)
I tried extending the timeout limit but it didn't work. I also tried using Web3 Providers (HTTPProvider and IPCProvider) but without any luck (i can give more details, if needed).
Any help is well appreciated because i spent a lot of time on it without getting anywhere. Unfortunately, i couldn't find anything on deploying smart contracts to a node that is running on docker. If needed, i can gladly give more details about what i did.
I managed to run smart contracts on a private network, not using docker however. Some things come to mind. did you run a miner on your network? you will need to run a miner so that the contract gets migrated. Did you make sure that the gaslimit is met when running the contract? the miners will wait for the max gas limit to be reached before processing any request.
Did you already deploy the contract? in the migration scripts you either create a new migration script by bumping the version or use the reset flag to run all migration scripts again.

NVIDIA Jetson Nano with Realsense 435i via Isaac - Camera not found

I posted about this over on the Isaac forums, but listing it here for visibility as well. I am trying to get the Isaac Realsense examples working on a Jetson Nano with my 435i (firmware downgraded to 5.11.15 per the Isaac documentation), but I've been unable to so far. I've got a Nano flashed with Jetpack4.3 and have installed all dependencies on both the desktop and the Nano. The realsense-viewer works fine, so I know the camera is functioning properly and is being detected by the Nano. However, when I run ./apps/samples/realsense_camera/realsense_camera it throws an error:
ERROR engine/alice/components/Codelet.cpp#229: Component 'camera/realsense' of type 'isaac::RealsenseCamera' reported FAILURE:
No device connected, please connect a RealSense device
ERROR engine/alice/backend/event_manager.cpp#42: Stopping node 'camera' because it reached status 'FAILURE'
I've attached the log of this output as well. I get the same error running locally on my desktop, but that's running through WSL so I was willing to write that off. Any suggestions would be greatly appreciated!
0m2020-06-15 17:18:20.620 INFO engine/alice/tools/websight.cpp#166: Loading websight...0m
33m2020-06-15 17:18:20.621 WARN engine/alice/backend/application_json_loader.cpp#174: This application does not have an explicit scheduler configuration. One will be autogenerated to the best of the system's abilities if possible.0m
0m2020-06-15 17:18:20.622 INFO engine/alice/backend/redis_backend.cpp#40: Successfully connected to Redis server.
0m
33m2020-06-15 17:18:20.623 WARN engine/alice/backend/backend.cpp#201: This application does not have an execution group configuration. One will be autogenerated to the best of the systems abilities if possible.0m
33m2020-06-15 17:18:20.623 WARN engine/gems/scheduler/scheduler.cpp#337: No default execution groups specified. Attempting to create scheduler configuration for 4 remaining cores. This may be non optimal for the system and application.0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#290: Scheduler execution groups are:0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#299: __BlockerGroup__: Cores = [3], Workers = No0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#299: __WorkerGroup__: Cores = [0, 1, 2], Workers = Yes0m
0m2020-06-15 17:18:20.660 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/realsense/librealsense_module.so': Now has 45 components total0m
0m2020-06-15 17:18:20.679 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/rgbd_processing/librgbd_processing_module.so': Now has 51 components total0m
0m2020-06-15 17:18:20.696 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/sight/libsight_module.so': Now has 54 components total0m
0m2020-06-15 17:18:20.720 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/viewers/libviewers_module.so': Now has 83 components total0m
90m2020-06-15 17:18:20.720 DEBUG engine/alice/application.cpp#348: Loaded 83 components: isaac::RealsenseCamera, isaac::alice::BufferAllocatorReport, isaac::alice::ChannelMonitor, isaac::alice::CheckJetsonPerformanceModel, isaac::alice::CheckOperatingSystem, isaac::alice::Config, isaac::alice::ConfigBridge, isaac::alice::ConfigLoader, isaac::alice::Failsafe, isaac::alice::FailsafeHeartbeat, isaac::alice::InteractiveMarkersBridge, isaac::alice::JsonToProto, isaac::alice::LifecycleReport, isaac::alice::MessageLedger, isaac::alice::MessagePassingReport, isaac::alice::NodeStatistics, isaac::alice::Pose, isaac::alice::Pose2Comparer, isaac::alice::PoseFromFile, isaac::alice::PoseInitializer, isaac::alice::PoseMessageInjector, isaac::alice::PoseToFile, isaac::alice::PoseToMessage, isaac::alice::PoseTree, isaac::alice::PoseTreeJsonBridge, isaac::alice::PoseTreeRelink, isaac::alice::ProtoToJson, isaac::alice::PyCodelet, isaac::alice::Random, isaac::alice::Recorder, isaac::alice::RecorderBridge, isaac::alice::Replay, isaac::alice::ReplayBridge, isaac::alice::Scheduling, isaac::alice::Sight, isaac::alice::SightChannelStatus, isaac::alice::Subgraph, isaac::alice::Subprocess, isaac::alice::TcpPublisher, isaac::alice::TcpSubscriber, isaac::alice::Throttle, isaac::alice::TimeOffset, isaac::alice::TimeSynchronizer, isaac::alice::UdpPublisher, isaac::alice::UdpSubscriber, isaac::map::Map, isaac::map::ObstacleAtlas, isaac::map::OccupancyGridMapLayer, isaac::map::PolygonMapLayer, isaac::map::WaypointMapLayer, isaac::navigation::DistanceMap, isaac::navigation::NavigationMap, isaac::navigation::RangeScanModelClassic, isaac::navigation::RangeScanModelFlatloc, isaac::rgbd_processing::DepthEdges, isaac::rgbd_processing::DepthImageFlattening, isaac::rgbd_processing::DepthImageToPointCloud, isaac::rgbd_processing::DepthNormals, isaac::rgbd_processing::DepthPoints, isaac::rgbd_processing::FreespaceFromDepth, isaac::sight::AliceSight, isaac::sight::SightWidget, isaac::sight::WebsightServer, isaac::viewers::BinaryMapViewer, isaac::viewers::ColorCameraViewer, isaac::viewers::DepthCameraViewer, isaac::viewers::Detections3Viewer, isaac::viewers::DetectionsViewer, isaac::viewers::FiducialsViewer, isaac::viewers::FlatscanViewer, isaac::viewers::GoalViewer, isaac::viewers::ImageKeypointViewer, isaac::viewers::LidarViewer, isaac::viewers::MosaicViewer, isaac::viewers::ObjectViewer, isaac::viewers::OccupancyMapViewer, isaac::viewers::PointCloudViewer, isaac::viewers::PoseTrailViewer, isaac::viewers::SegmentationCameraViewer, isaac::viewers::SegmentationViewer, isaac::viewers::SkeletonViewer, isaac::viewers::TensorViewer, isaac::viewers::TrajectoryListViewer, 0m
33m2020-06-15 17:18:20.723 WARN engine/alice/application.cpp#164: The function Application::findComponentByName is deprecated. Please use `getNodeComponentOrNull` instead. Note that the new method requires a node name instead of a component name. (argument: 'websight/isaac.sight.AliceSight')0m
0m2020-06-15 17:18:20.723 INFO engine/alice/application.cpp#255: Starting application 'realsense_camera' (instance UUID: 'e24992d0-af66-11ea-8bcf-c957460c567e') ...0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#476: Launching 0 pre-start job(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#485: Replaying 0 pre-start event(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#476: Launching 0 pre-start job(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#485: Replaying 0 pre-start event(s)0m
0m2020-06-15 17:18:20.723 INFO engine/alice/backend/asio_backend.cpp#33: Starting ASIO service0m
0m2020-06-15 17:18:20.727 INFO packages/sight/WebsightServer.cpp#216: Sight webserver is loaded0m
0m2020-06-15 17:18:20.727 INFO packages/sight/WebsightServer.cpp#217: Please open Chrome Browser and navigate to http://<ip address>:30000m
33m2020-06-15 17:18:20.727 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet 'websight/isaac.sight.AliceSight' was not added to scheduler because no tick method is specified.0m
33m2020-06-15 17:18:20.728 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
33m2020-06-15 17:18:20.728 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet '_check_operating_system/isaac.alice.CheckOperatingSystem' was not added to scheduler because no tick method is specified.0m
33m2020-06-15 17:18:20.728 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
33m2020-06-15 17:18:20.730 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
1;31m2020-06-15 17:18:20.741 ERROR engine/alice/components/Codelet.cpp#229: Component 'camera/realsense' of type 'isaac::RealsenseCamera' reported FAILURE:
No device connected, please connect a RealSense device
0m
1;31m2020-06-15 17:18:20.741 ERROR engine/alice/backend/event_manager.cpp#42: Stopping node 'camera' because it reached status 'FAILURE'0m
33m2020-06-15 17:18:20.743 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet 'camera/realsense' was not added to scheduler because no tick method is specified.0m
0m2020-06-15 17:18:21.278 INFO packages/sight/WebsightServer.cpp#113: Server connected / 10m
0m2020-06-15 17:18:30.723 INFO engine/alice/backend/allocator_backend.cpp#57: Optimized memory CPU allocator.0m
0m2020-06-15 17:18:30.724 INFO engine/alice/backend/allocator_backend.cpp#66: Optimized memory CUDA allocator.0m

Spark executor sends result to a random port though all the ports are explicitly set up

I am trying to run a spark job with PySpark through Jupyter notebook running in Docker. Workers are located on separate machines in the same network. I am performing a take operation on RDD:
data.take(number_of_elements)
When the number_of_elements is 2000 everything works fine. When it is 20000 an exception occurs. From my point of view it breaks when the size of the result exceeds 2GB (or it seems for me so). The idea about 2GB comes from that spark can send results smaller than 2GB in one block and when the result is bigger than 2GB another mechanism starts to work and something breaks there (see here). Here is the exception from executor log:
19/11/05 10:27:14 INFO CodeGenerator: Code generated in 205.7623 ms
19/11/05 10:27:40 INFO PythonRunner: Times: total = 25421, boot = 3, init = 1751, finish = 23667
19/11/05 10:27:42 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 927.7 MB, free 6.4 GB)
19/11/05 10:27:42 INFO Executor: Finished task 0.0 in stage 3.0 (TID 4). 972788748 bytes result sent via BlockManager)
19/11/05 10:27:49 ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1585998572000, chunkIndex=0}, buffer=org.apache.spark.storage.BlockManagerManagedBuffer#4399ad49} to /10.0.0.9:56222; closing connection
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.spark.util.io.ChunkedByteBufferFileRegion.transferTo(ChunkedByteBufferFileRegion.scala:64)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
at io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
As we can see from the log executor tries to send result to 10.0.0.9:56222. It fails because the port is not opened in docker compose. 10.0.0.9 is an IP address of a master node but port 56222 is random though I explicitly set up all ports I can find in documentation to disable random port selection:
spark = SparkSession.builder\
.master('spark://spark.cyber.com:7077')\
.appName('My App')\
.config('spark.task.maxFailures', '16')\
.config('spark.driver.port', '20002')\
.config('spark.driver.host', 'spark.cyber.com')\
.config('spark.driver.bindAddress', '0.0.0.0')\
.config('spark.blockManager.port', '6060')\
.config('spark.driver.blockManager.port', '6060')\
.config('spark.shuffle.service.port', '7070')\
.config('spark.driver.maxResultSize', '14g')\
.getOrCreate()
I mapped these ports with docker compose:
version: "3"
services:
jupyter:
image: jupyter/pyspark-notebook:latest
ports:
- "4040-4050:4040-4050"
- "6060:6060"
- "7070:7070"
- "8888:8888"
- "20000-20010:20000-20010"
You should probably configure you spark driver memory to follow your docker container memory settings
I added
.config('spark.driver.memory', '14g')
as #ML_TN proposed and everything works now.
From my point of view it is strange that the memory setting affects the ports that spark uses.

How to debug/fix random occurring Redis::TimeoutError?

I have a rails app running which is using redis quite a lot - however - I'm seeing quite a few Redis::TimeoutError occurring here and there, from time to time. There is no pattern in the circumstances. It occurs both in the web app and in the background jobs (which is being processed using sidekiq) - not often but from time to time.
Now I have no idea how to track down the root cause of this and hence no idea how to fix it.
Here is a little background on my setup:
The redis instance is running on a separate physical server which is connected to both my web server and background server in a private local 1Gbit network. All servers are running ubuntu 12.04. The redis version is 2.6.10. I'm connecting from my rails app (which is 3.2) using an initializer like so:
require 'redis'
require 'redis/objects'
REDIS = Redis.new(:url => APP_CONFIG['REDIS_URL'])
Redis.current = REDIS
This is the output of redis-cli INFO:
# Server
redis_version:2.6.10
redis_git_sha1:00000000
redis_git_dirty:0
redis_mode:standalone
os:Linux 3.2.0-38-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:28475
run_id:d89bbb1b81d3169c4228cf23c0988ae437d496a1
tcp_port:6379
uptime_in_seconds:14913365
uptime_in_days:172
lru_clock:1507056
# Clients
connected_clients:233
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:19
# Memory
used_memory:801637360
used_memory_human:764.50M
used_memory_rss:594706432
used_memory_peak:4295394784
used_memory_peak_human:4.00G
used_memory_lua:31744
mem_fragmentation_ratio:0.74
mem_allocator:jemalloc-3.3.0
# Persistence
loading:0
rdb_changes_since_last_save:23166
rdb_bgsave_in_progress:0
rdb_last_save_time:1378219310
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:4
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
# Stats
total_connections_received:932395
total_commands_processed:3088408103
instantaneous_ops_per_sec:837
rejected_connections:0
expired_keys:31428
evicted_keys:3007
keyspace_hits:124093049
keyspace_misses:53060192
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:17651
# Replication
role:master
connected_slaves:1
slave0:192.168.0.2,6379,online
# CPU
used_cpu_sys:54000.21
used_cpu_user:73692.52
used_cpu_sys_children:36229.79
used_cpu_user_children:420655.84
# Keyspace
db0:keys=1498962,expires=1310
In my redis config I have the following set:
\fidaemonize yes
pidfile /var/run/redis/redis-server.pid
timeout 0
loglevel notice
databases 1
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
maxclients 1000
maxmemory 4GB
maxmemory-policy volatile-lru
appendonly no
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
That could come from many issues :
because you use the SAVE command (it is setup in your conf) generating a lot of I/O and hammering the server, especially if you use EBS volumes on Amazon.
because you have a Redis slave (same as before, doing SAVE before mirroring).
because you use a KEY * which is very slow on a lot of indexes.
Try "slowlog" command on the redis server to see if there are some "slow query".
Write some logs when "TimeoutError" happens, to see if the "error redis command" in the "slow log".
adjust your timeout setting on the client side。
It might be a problem on the client side if server performs normally. Each redis client instance, not the server, also has a timeout setting, and the default setting is very short - something like a few milliseconds. So if the server does not respond within that time, a Redis::TimeoutError will be raised by the client.
First thing you can try is to set a longer timeout value, and see if things get better.
redis_url = 'redis://user:password#host:port/'
redis = Redis.connect(:url => redis_url, :timeout => 0.7)
Even with longer timeout setting, there is no guarantee that timeout would not happen, but then it'd be a problem of the design of your system.
Are you rolling your own code to connect to redis or just letting sidekiq handle it? I think you should really just design your connection code to reconnect if the connection has been lost. You can rescue Redis::BaseConnectionError and reconnect.

Resources