Flink Statefun netty DisconnectedException

Flink Statefun netty DisconnectedException - docker

I am trying to run a flink statefun (version 3.2.0) application on my local machine using docker, with a single task manager and a single job manager. The application is a pipeline of multiple services that communicate to each other via sending messages through Kafka to HTTP function endpoints using aiohttp with gunicorn. At the beginning of the pipeline is a service that pulls results from Amazon S3 and sends them to the rest of pipeline, at a rate of about 8000-10000 requests per minute.
When I run it, it at first runs successfully, but looking at the docker logs for the flink worker (task manager) container, I repeatedly see these warnings:
2022-03-18 17:35:43,315 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(analytics-transformer, dispatch, 77ce0dcb-347c-4c03-bc32-f7ebb734b930), batchSize=1, totalSizeInBytes=1323, numberOfStates=0)
org.apache.flink.statefun.flink.core.nettyclient.exceptions.DisconnectedException: Disconnected
18:25:27,594 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(web, statefun, 82936819-b3d9-4a24-b4eb-81a189d6306c), batchSize=1, totalSizeInBytes=1434, numberOfStates=0)
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
Eventually I see this warning as well:
2022-03-18 18:06:44,848 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(web, statefun, f004409f-77be-433c-8ab1-ae5f9dad605c), batchSize=1, totalSizeInBytes=1172, numberOfStates=0)
java.lang.IllegalStateException: FixedChannelPool was closed
And after some time the flink master fails due to request timeout and has to restart:
org.apache.flink.statefun.flink.core.functions.StatefulFunctionInvocationException: An error occurred when attempting to invoke function FunctionType(analytics-transformer, dispatch).
at org.apache.flink.statefun.flink.core.functions.StatefulFunction.receive(StatefulFunction.java:50) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.ReusableContext.apply(ReusableContext.java:74) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.LocalFunctionGroup.processNextEnvelope(LocalFunctionGroup.java:60) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.Reductions.processEnvelopes(Reductions.java:164) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.AsyncSink.drainOnOperatorThread(AsyncSink.java:119) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:338) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:324) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:201) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) ~[flink-dist_2.12-1.14.3.jar:1.14.3]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.lang.IllegalStateException: Failure forwarding a message to a remote function Address(analytics-transformer, dispatch, 77d07eb3-f499-4265-a456-b0f75d738830)
at org.apache.flink.statefun.flink.core.reqreply.RequestReplyFunction.onAsyncResult(RequestReplyFunction.java:170) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.reqreply.RequestReplyFunction.invoke(RequestReplyFunction.java:124) ~[statefun-flink-core.jar:3.2.0]
at org.apache.flink.statefun.flink.core.functions.StatefulFunction.receive(StatefulFunction.java:48) ~[statefun-flink-core.jar:3.2.0]
... 16 more
Caused by: org.apache.flink.statefun.flink.core.nettyclient.exceptions.RequestTimeoutException
I am guessing this is a load issue due to the number of incoming requests, where the worker is unable to handle them all. This is what I have configured for each of the HTTP function endpoints in the module.yaml:
spec:
functions: <function>
urlPathTemplate: <url>
transport:
type: io.statefun.transports.v1/async
call: 15min
connect: 15min
pool_ttl: 45s
pool_size: 1024
payload_max_bytes: 33554432
I find that decreasing the pool size to a small value like 20 reduces the number of warnings, but then later on I see this warning a lot:
2022-03-18 15:44:52,566 WARN org.apache.flink.statefun.flink.core.nettyclient.NettyRequest [] - Exception caught while trying to deliver a message: (attempt #0)ToFunctionRequestSummary(address=Address(analytics-transformer, dispatch, 7facef98-b659-442e-846f-4e4d45559555), batchSize=1, totalSizeInBytes=739, numberOfStates=0)
org.apache.flink.shaded.netty4.io.netty.channel.pool.FixedChannelPool$AcquireTimeoutException: Acquire operation took longer then configured maximum time
which I'm assuming means that the connections were not able to form before the timeout due to the small pool size compared to the large number of requests.
Here is the flink-conf.yaml:
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is the base for the Apache Flink configuration
statefun.flink-job-name: Statefun Application
#==============================================================================
# Configurations strictly required by Stateful Functions. Do not change.
#==============================================================================
classloader.parent-first-patterns.additional: org.apache.flink.statefun;org.apache.kafka;com.google.protobuf
#==============================================================================
# Fault tolerance, checkpointing and recovery.
# For more related configuration options, please see: https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#fault-tolerance
#==============================================================================
# Uncomment the below to enable checkpointing for your application
#execution.checkpointing.mode: EXACTLY_ONCE
#execution.checkpointing.interval: 5sec
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 2147483647
restart-strategy.fixed-delay.delay: 1sec
state.backend.local-recovery: true
state.backend: rocksdb
state.backend.rocksdb.timer-service.factory: ROCKSDB
state.backend.rocksdb.localdir: /local/state/rocksdb
state.backend.rocksdb.memory.partitioned-index-filters: true
state.backend.rocksdb.checkpoint.transfer.thread.num: 8
state.backend.rocksdb.thread.num: 4
state.checkpoints.dir: file:///checkpoint-dir
state.backend.incremental: true
taskmanager.state.local.root-dirs: file:///local/state/recovery
#==============================================================================
# Recommended memory configurations. Users may change according to their needs.
#==============================================================================
jobmanager.memory.process.size: 1g
taskmanager.memory.process.size: 4g
#==============================================================================
# Support easy upgrades as the module.yaml file updates
#==============================================================================
pipeline.auto-generate-uids: false
execution.savepoint.ignore-unclaimed-state: true
statefun.async.max-per-task: 163840
execution.checkpointing.mode: EXACTLY_ONCE
execution.checkpointing.interval: 5sec
I have also tried increasing the number of gunicorn workers and setting taskmanager.network.netty.server.numThreads to 100 in the flink-conf.yaml, but this does not seem to fix the issue.

Related

Apache Flume agent not starting but not showing error

I'm attempting to run an Apache Flume agent from an AWS EC2 cluster but when I start the agent, it neither starts nor throws an obvious error.
I'm just starting with the simple example from Apache's documentation.
When I run:
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console
The console output is the following:
Info: Sourcing environment configuration script /home/ubuntu/Flume/conf/flume-env.sh
Info: Including HBASE libraries found via (/home/ubuntu/hbase-2.4.4/bin/hbase) for HBASE access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx20m -Dflume.root.logger=DEBUG,console -Dflume.root.logger=DEBUG,console -cp '/home/ubuntu/Flume/conf:/home/ubuntu/Flume/lib/*:/home/ubuntu/hbase-2.4.4/conf:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/home/ubuntu/hbase-2.4.4:/home/ubuntu/hbase-2.4.4/lib/shaded-clients/hbase-shaded-client-2.4.4.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/slf4j-api-1.7.30.jar:/home/ubuntu/hbase-2.4.4/conf:/lib/*' -Djava.library.path=:/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib org.apache.flume.node.Application --conf-file example.conf --name a1 ./bin/flume-ng agent --conf-file example.conf --name a1
The agent doesn't throw an error but never gets further than this. I have also tried some variations including --conf-file conf/example.conf
Flume and Java appear to be installed correctly:
Flume
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng version
Source code repository: https://git.apache.org/repos/asf/flume.git
Revision: 1a15927e594fd0d05a59d804b90a9c31ec93f5e1
Compiled by rgoers on Sun Oct 16 14:44:15 MST 2022
From source with checksum bbbca682177262aac3a89defde369a37
Java
ubuntu#ip-172-31-41-5:~/Flume$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu222.04, mixed mode, sharing)
Example.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
Flume.env
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced
# during Flume startup.
# Enviroment variables can be set here.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
# Let Flume write raw event data and configuration information to its log files for debugging
# purposes. Enabling these flags is not recommended in production,
# as it may result in logging sensitive user information or encryption secrets.
# export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "
# Note that the Flume conf directory is always included in the classpath.
#FLUME_CLASSPATH=""
The only clue that I have is is in flume.log which shows the following error. I've even copied example.conf into the main Flume directory but it doesn't seem to make a difference.
03 Dec 2022 21:10:03,538 ERROR [main] (org.apache.flume.node.Application.main:506) - A fatal error occurred while running. Exception follows.org.apache.flume.conf.ConfigurationException: Unable to read file /home/ubuntu/Flume/example.confat org.apache.flume.node.FileConfigurationSource.<init>(FileConfigurationSource.java:52) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.FileConfigurationSourceFactory.createConfigurationSource(FileConfigurationSourceFactory.java:40) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.ConfigurationSourceFactory.getConfigurationSource(ConfigurationSourceFactory.java:39) ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.Application.main(Application.java:476) ~[flume-ng-node-1.11.0.jar:1.11.0]Caused by: java.nio.file.NoSuchFileException: /home/ubuntu/Flume/example.confat sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_352]at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel(Files.java:361) ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel(Files.java:407) ~[?:1.8.0_352]at java.nio.file.Files.readAllBytes(Files.java:3152) ~[?:1.8.0_352]at org.apache.flume.node.FileConfigurationSource.<init>(FileConfigurationSource.java:49) ~[flume-ng-node-1.11.0.jar:1.11.0]... 3 more

NVIDIA Jetson Nano with Realsense 435i via Isaac - Camera not found

I posted about this over on the Isaac forums, but listing it here for visibility as well. I am trying to get the Isaac Realsense examples working on a Jetson Nano with my 435i (firmware downgraded to 5.11.15 per the Isaac documentation), but I've been unable to so far. I've got a Nano flashed with Jetpack4.3 and have installed all dependencies on both the desktop and the Nano. The realsense-viewer works fine, so I know the camera is functioning properly and is being detected by the Nano. However, when I run ./apps/samples/realsense_camera/realsense_camera it throws an error:
ERROR engine/alice/components/Codelet.cpp#229: Component 'camera/realsense' of type 'isaac::RealsenseCamera' reported FAILURE:
No device connected, please connect a RealSense device
ERROR engine/alice/backend/event_manager.cpp#42: Stopping node 'camera' because it reached status 'FAILURE'
I've attached the log of this output as well. I get the same error running locally on my desktop, but that's running through WSL so I was willing to write that off. Any suggestions would be greatly appreciated!
0m2020-06-15 17:18:20.620 INFO engine/alice/tools/websight.cpp#166: Loading websight...0m
33m2020-06-15 17:18:20.621 WARN engine/alice/backend/application_json_loader.cpp#174: This application does not have an explicit scheduler configuration. One will be autogenerated to the best of the system's abilities if possible.0m
0m2020-06-15 17:18:20.622 INFO engine/alice/backend/redis_backend.cpp#40: Successfully connected to Redis server.
0m
33m2020-06-15 17:18:20.623 WARN engine/alice/backend/backend.cpp#201: This application does not have an execution group configuration. One will be autogenerated to the best of the systems abilities if possible.0m
33m2020-06-15 17:18:20.623 WARN engine/gems/scheduler/scheduler.cpp#337: No default execution groups specified. Attempting to create scheduler configuration for 4 remaining cores. This may be non optimal for the system and application.0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#290: Scheduler execution groups are:0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#299: __BlockerGroup__: Cores = [3], Workers = No0m
0m2020-06-15 17:18:20.623 INFO engine/gems/scheduler/scheduler.cpp#299: __WorkerGroup__: Cores = [0, 1, 2], Workers = Yes0m
0m2020-06-15 17:18:20.660 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/realsense/librealsense_module.so': Now has 45 components total0m
0m2020-06-15 17:18:20.679 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/rgbd_processing/librgbd_processing_module.so': Now has 51 components total0m
0m2020-06-15 17:18:20.696 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/sight/libsight_module.so': Now has 54 components total0m
0m2020-06-15 17:18:20.720 INFO engine/alice/backend/modules.cpp#226: Loaded module 'packages/viewers/libviewers_module.so': Now has 83 components total0m
90m2020-06-15 17:18:20.720 DEBUG engine/alice/application.cpp#348: Loaded 83 components: isaac::RealsenseCamera, isaac::alice::BufferAllocatorReport, isaac::alice::ChannelMonitor, isaac::alice::CheckJetsonPerformanceModel, isaac::alice::CheckOperatingSystem, isaac::alice::Config, isaac::alice::ConfigBridge, isaac::alice::ConfigLoader, isaac::alice::Failsafe, isaac::alice::FailsafeHeartbeat, isaac::alice::InteractiveMarkersBridge, isaac::alice::JsonToProto, isaac::alice::LifecycleReport, isaac::alice::MessageLedger, isaac::alice::MessagePassingReport, isaac::alice::NodeStatistics, isaac::alice::Pose, isaac::alice::Pose2Comparer, isaac::alice::PoseFromFile, isaac::alice::PoseInitializer, isaac::alice::PoseMessageInjector, isaac::alice::PoseToFile, isaac::alice::PoseToMessage, isaac::alice::PoseTree, isaac::alice::PoseTreeJsonBridge, isaac::alice::PoseTreeRelink, isaac::alice::ProtoToJson, isaac::alice::PyCodelet, isaac::alice::Random, isaac::alice::Recorder, isaac::alice::RecorderBridge, isaac::alice::Replay, isaac::alice::ReplayBridge, isaac::alice::Scheduling, isaac::alice::Sight, isaac::alice::SightChannelStatus, isaac::alice::Subgraph, isaac::alice::Subprocess, isaac::alice::TcpPublisher, isaac::alice::TcpSubscriber, isaac::alice::Throttle, isaac::alice::TimeOffset, isaac::alice::TimeSynchronizer, isaac::alice::UdpPublisher, isaac::alice::UdpSubscriber, isaac::map::Map, isaac::map::ObstacleAtlas, isaac::map::OccupancyGridMapLayer, isaac::map::PolygonMapLayer, isaac::map::WaypointMapLayer, isaac::navigation::DistanceMap, isaac::navigation::NavigationMap, isaac::navigation::RangeScanModelClassic, isaac::navigation::RangeScanModelFlatloc, isaac::rgbd_processing::DepthEdges, isaac::rgbd_processing::DepthImageFlattening, isaac::rgbd_processing::DepthImageToPointCloud, isaac::rgbd_processing::DepthNormals, isaac::rgbd_processing::DepthPoints, isaac::rgbd_processing::FreespaceFromDepth, isaac::sight::AliceSight, isaac::sight::SightWidget, isaac::sight::WebsightServer, isaac::viewers::BinaryMapViewer, isaac::viewers::ColorCameraViewer, isaac::viewers::DepthCameraViewer, isaac::viewers::Detections3Viewer, isaac::viewers::DetectionsViewer, isaac::viewers::FiducialsViewer, isaac::viewers::FlatscanViewer, isaac::viewers::GoalViewer, isaac::viewers::ImageKeypointViewer, isaac::viewers::LidarViewer, isaac::viewers::MosaicViewer, isaac::viewers::ObjectViewer, isaac::viewers::OccupancyMapViewer, isaac::viewers::PointCloudViewer, isaac::viewers::PoseTrailViewer, isaac::viewers::SegmentationCameraViewer, isaac::viewers::SegmentationViewer, isaac::viewers::SkeletonViewer, isaac::viewers::TensorViewer, isaac::viewers::TrajectoryListViewer, 0m
33m2020-06-15 17:18:20.723 WARN engine/alice/application.cpp#164: The function Application::findComponentByName is deprecated. Please use `getNodeComponentOrNull` instead. Note that the new method requires a node name instead of a component name. (argument: 'websight/isaac.sight.AliceSight')0m
0m2020-06-15 17:18:20.723 INFO engine/alice/application.cpp#255: Starting application 'realsense_camera' (instance UUID: 'e24992d0-af66-11ea-8bcf-c957460c567e') ...0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#476: Launching 0 pre-start job(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#485: Replaying 0 pre-start event(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#476: Launching 0 pre-start job(s)0m
90m2020-06-15 17:18:20.723 DEBUG engine/gems/scheduler/execution_groups.cpp#485: Replaying 0 pre-start event(s)0m
0m2020-06-15 17:18:20.723 INFO engine/alice/backend/asio_backend.cpp#33: Starting ASIO service0m
0m2020-06-15 17:18:20.727 INFO packages/sight/WebsightServer.cpp#216: Sight webserver is loaded0m
0m2020-06-15 17:18:20.727 INFO packages/sight/WebsightServer.cpp#217: Please open Chrome Browser and navigate to http://<ip address>:30000m
33m2020-06-15 17:18:20.727 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet 'websight/isaac.sight.AliceSight' was not added to scheduler because no tick method is specified.0m
33m2020-06-15 17:18:20.728 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
33m2020-06-15 17:18:20.728 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet '_check_operating_system/isaac.alice.CheckOperatingSystem' was not added to scheduler because no tick method is specified.0m
33m2020-06-15 17:18:20.728 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
33m2020-06-15 17:18:20.730 WARN engine/alice/components/Codelet.cpp#53: Function deprecated. Set tick_period to the desired tick paramater0m
1;31m2020-06-15 17:18:20.741 ERROR engine/alice/components/Codelet.cpp#229: Component 'camera/realsense' of type 'isaac::RealsenseCamera' reported FAILURE:
No device connected, please connect a RealSense device
0m
1;31m2020-06-15 17:18:20.741 ERROR engine/alice/backend/event_manager.cpp#42: Stopping node 'camera' because it reached status 'FAILURE'0m
33m2020-06-15 17:18:20.743 WARN engine/alice/backend/codelet_canister.cpp#225: Codelet 'camera/realsense' was not added to scheduler because no tick method is specified.0m
0m2020-06-15 17:18:21.278 INFO packages/sight/WebsightServer.cpp#113: Server connected / 10m
0m2020-06-15 17:18:30.723 INFO engine/alice/backend/allocator_backend.cpp#57: Optimized memory CPU allocator.0m
0m2020-06-15 17:18:30.724 INFO engine/alice/backend/allocator_backend.cpp#66: Optimized memory CUDA allocator.0m

caffe powered and GPU enabled Microsoft Azure VM

I'm trying to build a VM for model training in Azure. I found this Data Science Virtual Machine for Linux (Ubuntu) VM which seems to be a suitable candidate.
Unfortunately, when I spun up the VM and installed the caffe prerequisites I wasn't able to run the tests. I'm getting the following error on make runtest (make all and make test were completed without errors):
NVIDIA: no NVIDIA devices found
Cuda number of devices: 0
Setting to use device 0
Current device id: 0
Current device name:
Note: Randomizing tests' orders with a seed of 97204 .
[==========] Running 2041 tests from 267 test cases.
[----------] Global test environment set-up.
[----------] 11 tests from AdaDeltaSolverTest/3, where TypeParam = caffe::GPUDevice<double>
[ RUN ] AdaDeltaSolverTest/3.TestAdaDeltaLeastSquaresUpdateWithHalfMomentum
NVIDIA: no NVIDIA devices found
E0715 02:24:32.097311 59355 common.cpp:114] Cannot create Cublas handle. Cublas won't be available.
NVIDIA: no NVIDIA devices found
E0715 02:24:32.103780 59355 common.cpp:121] Cannot create Curand generator. Curand won't be available.
F0715 02:24:32.103914 59355 test_gradient_based_solver.cpp:80] Check failed: error == cudaSuccess (30 vs. 0) unknown error
*** Check failure stack trace: ***
# 0x7f77a463f5cd google::LogMessage::Fail()
# 0x7f77a4641433 google::LogMessage::SendToLog()
# 0x7f77a463f15b google::LogMessage::Flush()
# 0x7f77a4641e1e google::LogMessageFatal::~LogMessageFatal()
# 0x7115e3 caffe::GradientBasedSolverTest<>::TestLeastSquaresUpdate()
# 0x7122af caffe::AdaDeltaSolverTest_TestAdaDeltaLeastSquaresUpdateWithHalfMomentum_Test<>::TestBody()
# 0x8e6023 testing::internal::HandleExceptionsInMethodIfSupported<>()
# 0x8df63a testing::Test::Run()
# 0x8df788 testing::TestInfo::Run()
# 0x8df865 testing::TestCase::Run()
# 0x8e0b3f testing::internal::UnitTestImpl::RunAllTests()
# 0x8e0e63 testing::UnitTest::Run()
# 0x466ecd main
# 0x7f77a111c830 __libc_start_main
# 0x46e589 _start
# (nil) (unknown)
Makefile:532: recipe for target 'runtest' failed
make: *** [runtest] Aborted (core dumped)
Is it possible to spin up a virtual machine in Azure suitable for GPU enabled machine learning using caffe?
All the details about the VM here

The Data Science Virtual Machine (DSVM) for Ubuntu already has Caffe installed in /opt/caffe. To use it on a GPU, create a VM with a K80 GPU by choosing the one of the NC sizes. (Be sure to choose HDD as the storage type, or the NC sizes will not appear.) Caffe will then be available out of the box.
Also note that PyCaffe is available. At a terminal:
source activate root
And python will then have PyCaffe available.

Issues with Flume HDFS sink from Twitter

I currently have this configuration in Flume :
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = YPTxqtRamIZ1bnJXYwGW
TwitterAgent.sources.Twitter.consumerSecret = Wjyw9714OBzao7dktH0csuTByk4iLG9Zu4ddtI6s0ho
TwitterAgent.sources.Twitter.accessToken = 2340010790-KhWiNLt63GuZ6QZNYuPMJtaMVjLFpiMP4A2v
TwitterAgent.sources.Twitter.accessTokenSecret = x1pVVuyxfvaTbPoKvXqh2r5xUA6tf9einoByLIL8rar
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
The twitter app auth keys are correct.
And I keep getting this error in the flume log file:
ERROR org.apache.flume.SinkRunner
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop1
at org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:446)
at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoop1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2310)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2344)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2326)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:353)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:227)
at org.apache.flume.sink.hdfs.BucketWriter$1.call(BucketWriter.java:221)
at org.apache.flume.sink.hdfs.BucketWriter$8$1.run(BucketWriter.java:589)
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(BucketWriter.java:161)
at org.apache.flume.sink.hdfs.BucketWriter.access$800(BucketWriter.java:57)
at org.apache.flume.sink.hdfs.BucketWriter$8.call(BucketWriter.java:586)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
... 1 more
Caused by: java.net.UnknownHostException: hadoop1
... 23 more
Does any one here knows why and could explain it to me?
Thanks in advance.

According to the Exception, the problem is that the host hadoop1 is unknown.
according to the flume configuration file the path you have given is
hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
which is supposed to be accessible from the machine with the flume agent. since machine names cannot be used to access the HDFS without being in the same domain, you need to access the HDFS using the IP address as set in core-site.xml

How to debug/fix random occurring Redis::TimeoutError?

I have a rails app running which is using redis quite a lot - however - I'm seeing quite a few Redis::TimeoutError occurring here and there, from time to time. There is no pattern in the circumstances. It occurs both in the web app and in the background jobs (which is being processed using sidekiq) - not often but from time to time.
Now I have no idea how to track down the root cause of this and hence no idea how to fix it.
Here is a little background on my setup:
The redis instance is running on a separate physical server which is connected to both my web server and background server in a private local 1Gbit network. All servers are running ubuntu 12.04. The redis version is 2.6.10. I'm connecting from my rails app (which is 3.2) using an initializer like so:
require 'redis'
require 'redis/objects'
REDIS = Redis.new(:url => APP_CONFIG['REDIS_URL'])
Redis.current = REDIS
This is the output of redis-cli INFO:
# Server
redis_version:2.6.10
redis_git_sha1:00000000
redis_git_dirty:0
redis_mode:standalone
os:Linux 3.2.0-38-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:28475
run_id:d89bbb1b81d3169c4228cf23c0988ae437d496a1
tcp_port:6379
uptime_in_seconds:14913365
uptime_in_days:172
lru_clock:1507056
# Clients
connected_clients:233
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:19
# Memory
used_memory:801637360
used_memory_human:764.50M
used_memory_rss:594706432
used_memory_peak:4295394784
used_memory_peak_human:4.00G
used_memory_lua:31744
mem_fragmentation_ratio:0.74
mem_allocator:jemalloc-3.3.0
# Persistence
loading:0
rdb_changes_since_last_save:23166
rdb_bgsave_in_progress:0
rdb_last_save_time:1378219310
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:4
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
# Stats
total_connections_received:932395
total_commands_processed:3088408103
instantaneous_ops_per_sec:837
rejected_connections:0
expired_keys:31428
evicted_keys:3007
keyspace_hits:124093049
keyspace_misses:53060192
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:17651
# Replication
role:master
connected_slaves:1
slave0:192.168.0.2,6379,online
# CPU
used_cpu_sys:54000.21
used_cpu_user:73692.52
used_cpu_sys_children:36229.79
used_cpu_user_children:420655.84
# Keyspace
db0:keys=1498962,expires=1310
In my redis config I have the following set:
\ﬁdaemonize yes
pidfile /var/run/redis/redis-server.pid
timeout 0
loglevel notice
databases 1
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
maxclients 1000
maxmemory 4GB
maxmemory-policy volatile-lru
appendonly no
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60

That could come from many issues :
because you use the SAVE command (it is setup in your conf) generating a lot of I/O and hammering the server, especially if you use EBS volumes on Amazon.
because you have a Redis slave (same as before, doing SAVE before mirroring).
because you use a KEY * which is very slow on a lot of indexes.

Try "slowlog" command on the redis server to see if there are some "slow query".
Write some logs when "TimeoutError" happens, to see if the "error redis command" in the "slow log".
adjust your timeout setting on the client side。

It might be a problem on the client side if server performs normally. Each redis client instance, not the server, also has a timeout setting, and the default setting is very short - something like a few milliseconds. So if the server does not respond within that time, a Redis::TimeoutError will be raised by the client.
First thing you can try is to set a longer timeout value, and see if things get better.
redis_url = 'redis://user:password#host:port/'
redis = Redis.connect(:url => redis_url, :timeout => 0.7)
Even with longer timeout setting, there is no guarantee that timeout would not happen, but then it'd be a problem of the design of your system.

Are you rolling your own code to connect to redis or just letting sidekiq handle it? I think you should really just design your connection code to reconnect if the connection has been lost. You can rescue Redis::BaseConnectionError and reconnect.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Flink Statefun netty DisconnectedException - docker

Related

Apache Flume agent not starting but not showing error

NVIDIA Jetson Nano with Realsense 435i via Isaac - Camera not found

caffe powered and GPU enabled Microsoft Azure VM

Issues with Flume HDFS sink from Twitter

How to debug/fix random occurring Redis::TimeoutError?

Categories

Resources