Ros launch gets stuck on robot in 5G network - ros

I am trying to set up a Ros Network between a Robot (Clearpath AGV) and a Multi-access edge computing device (MEC) via 5G. I start roscore on the MEC and launch some nodes there. Everything works fine, but when when starting any launch file on robot side they get stuck as if waiting for the roscore (with and without --wait). This happens also for a simple static_transform_publisher launch. Please find logs of an example launch and the terminal output when I interrupt via keyboard below
The Ros Network seems to be fully established since:
Rostopic list/echo/info works for topics I publish to both ways
Rosrun works on Robot and MEC
Various networking tools (Netcat,...) show that all registered ports are open both ways
More details:
Working with docker with kinetic installed.
Containers run in host network mode
ROS_MASTER_URI is set to the MECs IP
ROS_IP on the robot is set to its own IP
Both IPs are pingable from either machine.
Any help is appreciated!
Logs:
[roslaunch][INFO] 2020-10-05 15:35:44,123: starting in server mode
[roslaunch.parent][INFO] 2020-10-05 15:35:44,123: starting roslaunch parent run
[roslaunch][INFO] 2020-10-05 15:35:44,123: loading roscore config file /opt/ros/kinetic/etc/ros/roscore.xml
[roslaunch] [INFO] 2020-10-05 15:35:44,365: Added core node of type [rosout/rosout] in namespace [/]
[roslaunch.config][INFO] 2020-10-05 15:35:44,365: loading config file /root/catkin_ws/src/realsense-ro
s/realsense2_camera/launch/rs_camera. launch
[roslaunch] [INFO] 2020-10-05 15:35:44,391: Added node of type [nodelet/nodelet] in namespace [/camera/]
[roslaunch] [INFO] 2020-10-05 15:35:44,397: Added node of type [nodelet/nodelet] in namespace [/camera/]
[roslaunch][INFO] 2020-10-05 15:35:44,397: ... selected machine [] for node of type [nodelet/nodelet]
[roslaunch][INFO] 2020-10-05 15:35:44,397: ... selected machine [] for node of type [nodelet/nodelet]
[roslaunch.pmon][INFO] 2020-10-05 15:35:44,400: start_process_ monitor: creating ProcessMonitor
[roslaunch.pmon][INFO] 2020-10-05 15:35:44,400: created process monitor <ProcessMonitor(ProcessMonitor
-1, initial daemon)>
[roslaunch.pmon][INFO] 2020-10-05 15:35:44,400: start_process_ monitor: ProcessMonitor started
[roslaunch.parent][INFO] 2020-10-05 15:35:44,400: starting parent XML-RPC server
[roslaunch.server][INFO] 2020-10-05 15:35:44,400: starting roslaunch XML-RPC server
[roslaunch.server][INFO] 2020-10-05 15:35:44,400: waiting for roslaunch XML-RPC server to initialize
[xmlrpc] [INFO] 2020-10-05 15:35:44,400: XML-RPC server binding to 0.0.0.0:0
[xmlrpc] [INFO] 2020-10-05 15:35:44,401: Started XML-RPC server [http://10.0.91.105:41500/ ]
[xmlrpc] [INFO] 2020-10-05 15:35:44,401: xml rpc node: starting XML-RPC server
[roslaunch.pmon][INFO] 2020-10-05 15:35:50,254: ProcessMonitor.shutdown <ProcessMonitor (ProcessMonitor
-1, started daemon 140017957467904)>
[roslaunch.pmon][INFO] 2020-10-05 15:35:50,308: ProcessMonitor. post_run <ProcessMonitor(ProcessMonito
r-1, started daemon 140017957467904)>
[roslaunch.pmon][INFO] 2020-10-05 15:35:50,308: ProcessMonitor. post_run <ProcessMonitor(ProcessMonito
r-1, started daemon 140017957467904)>: remaining procs are []
[roslaunch.pmon][INFO] 2020-10-05 15:35:50,309: ProcessMonitor exit: cleaning up data structures and signals
[roslaunch.pmon][INFO] 2020-10-05 15:35:50,310: ProcessMonitor exit: pmon has shutdown
[rospy.core][INFO] 2020-10-05 15:35:50,312: signal_shutdown [atexit]
Terminal output:
root#CPR-R100-0067:~/catkin_ws# roslaunch realsense2_camera rs_camera. launch
«.. logging to /root/.ros/log/9cbf306c -0718-11eb-b8a5- fa163ea8af66/roslaunch-CPR-R100- 0067-954. log
Checking log directory for disk usage.
This may take awhile.
Press Ctrl-C to interruptDone checking log file disk usage.
Usage is <1GB.
^CTraceback (most recent call last):
File "“/opt/ros/kinetic/bin/roslaunch”, line 35, in <module>roslaunch.main()
File "/opt/ros/kinetic/lib/python2.7/dist-packages/roslaunch/__ init__.py", line 308, in mainp.start()
File "“/opt/ros/kinetic/lib/python2.7/dist-packages/roslaunch/parent.py”, line 268, in startself._start_infrastructure()
File "/opt/ros/kinetic/lib/python2.7/dist-packages/roslaunch/parent.py", Line 226, in _start_infrastructureself.start_server()
File "/opt/ros/kinetic/lib/python2.7/dist-packages/roslaunch/parent.py", line 177, in start_serverself.server.start()
File "“/opt/ros/kinetic/lib/python2.7/dist-packages/roslaunch/server.py", line 376, in startcode, msg, val = ServerProxy(self.urt).get_pid()
File “/usc/lib/python2.7/xmlrpclib.py", line 1243, tn _ callreturn self.__send(self.name, args)
File "“/usc/lib/python2.7/xmlrpclib.py", line 1602, in _ requestverbose=self. verbose
File "“/usc/lib/python2.7/xmlrpclib.py", line 1283, in requestreturn self.single_request(host, handler, request_body, verbose)
File "“/usc/lib/python2.7/xmlrpclib.py", line 1311, tn single_requestself.send_content(h, request_body)
File "“/usc/lib/python2.7/xmlrpclib.py", line 1459, tn send_contentconnection.endheaders(request_body)
File "“/usr/lib/python2.7/httplib.py”, line 1082, in endheadersself._send_output(message_body)
File "“/usr/lib/python2.7/httplib.py”, Line 909, in _send_outputself.send(msg)
File "“/usr/lib/python2.7/httplib.py", line 871, in sendself .connect()
File "“/usr/lib/python2.7/httplib.py”, Line 848, in connectself.timeout, self.source_address)
File "“/usr/lib/python2.7/socket.py", line 566, in create_connectionsock.connect(sa)
File "“/usr/lib/python2.7/socket.py", line 228, in methreturn getattr(self._sock,name)(*args)
KeyboardInterrupt

Related

IoTEdge on K8S, Could not initialize module runtime

I'm running iotedge on kubernetes.
The K8S cluster is a local cluster setup largely using the "Kubernetes the hard way" method, with some modifications.
I did manage to get things working on one installation. However, I'm now getting this on another installation. The initial installation works fine, but after shutting down a machine to simulate a hardware failure, the pod gets recreated, but starts to show this error again. This error happens EVEN if the node shutdown is NOT the one iotedged is running on.
Environment
3 Nodes running Ubuntu 20.04 LTS
Two networks on each node, one for the internet, one for an internal network. K8S is setup using the internal, static IP address
HAProxy/Keepalived for HA without a load balancer, running on a Virtual IP address
Multus CNI for attaching pods to additional networks
CoreDNS
Troubleshooting
Confirmed that CoreDNS seems to be functioning fine, and is able to resolve internal and external addresses
Remaining nodes are able to ping pods on other nodes
Deleting the iotedged pod and allowing k8s to recreate it works, but then edgeAgent an edgeHub have errors until I delete/recreate them as well
Re-run the entire k8s installation. Initial installation works fine, but simulating machine failure continues to be problematic.
Kubernetes Versions:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
edgeiotd error:
<6>2021-07-09T22:00:02Z [INFO] - Starting Azure IoT Edge Security Daemon - Kubernetes mode
<6>2021-07-09T22:00:02Z [INFO] - Version - 1.1.3
<6>2021-07-09T22:00:02Z [INFO] - Using config file: /etc/iotedged/config.yaml
<6>2021-07-09T22:00:02Z [INFO] - Configuring /var/lib/iotedge as the home directory.
<6>2021-07-09T22:00:02Z [INFO] - Configuring certificates...
<6>2021-07-09T22:00:02Z [INFO] - Transparent gateway certificates not found, operating in quick start mode...
<6>2021-07-09T22:00:02Z [INFO] - Finished configuring provisioning environment variables and certificates.
<6>2021-07-09T22:00:02Z [INFO] - Initializing hsm...
<6>2021-07-09T22:00:02Z [INFO] - Finished initializing hsm.
<6>2021-07-09T22:00:02Z [INFO] - Provisioning edge device...
<6>2021-07-09T22:00:02Z [INFO] - Starting provisioning edge device via manual mode using a device connection string...
<6>2021-07-09T22:00:02Z [INFO] - Manually provisioning device "********" in hub "********.azure-devices.net"
<6>2021-07-09T22:00:02Z [INFO] - Finished provisioning edge device.
<6>2021-07-09T22:00:02Z [INFO] - Initializing the module runtime...
<6>2021-07-09T22:00:02Z [INFO] - Attempting to use config from /home/edgeletuser/.kube/config file.
<6>2021-07-09T22:00:02Z [INFO] - Using in-cluster config
<3>2021-07-09T22:00:34Z [ERR!] - The daemon could not start up successfully: Could not initialize module runtime
<3>2021-07-09T22:00:34Z [ERR!] - caused by: Could not initialize kubernetes module runtime
<3>2021-07-09T22:00:34Z [ERR!] - caused by: HTTP response error: SelfSubjectAccessReviewCreate
<3>2021-07-09T22:00:34Z [ERR!] - caused by: Hyper HTTP error
<3>2021-07-09T22:00:34Z [ERR!] - caused by: error trying to connect: Connection timed out (os error 110)
<6>2021-07-09T22:00:02Z [INFO] (/project/hsm-sys/azure-iot-hsm-c/src/hsm_log.c:log_init:41) Initialized logging
edgeHub Logs after recreating iotedged:
2021-08-18 19:05:40 Starting Edge Hub
2021-08-18 19:05:40.481 +00:00 Edge Hub Main()
<7> 2021-08-18 19:05:40.609 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClient] - Making a Http call to http://localhost:35001/ to CreateServerCertificateAsync
<7> 2021-08-18 19:05:40.912 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClient] - Error when getting an Http response from http://localhost:35001/ for CreateServerCertificateAsync
HTTP Response:
{"message":"Module not found"}
Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.IoTEdgedException`1[Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.ErrorResponse]: Not Found
at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.HttpWorkloadClient.CreateServerCertificateAsync(String api_version, String name, String genid, ServerCertificateRequest request, CancellationToken cancellationToken) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/generatedCode/HttpWorkloadClient.cs:line 624
at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 126
at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 59
Unhandled exception. System.AggregateException: One or more errors occurred. (Error calling CreateServerCertificateAsync: Module not found)
---> Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadCommunicationException- Message:Error calling CreateServerCertificateAsync: Module not found, StatusCode:404, at: at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.HandleException(Exception ex, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 109
at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 77
at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.CreateServerCertificateAsync(String hostname, DateTime expiration) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 35
at Microsoft.Azure.Devices.Edge.Util.CertificateHelper.GetServerCertificatesFromEdgelet(Uri workloadUri, String workloadApiVersion, String workloadClientApiVersion, String moduleId, String moduleGenerationId, String edgeHubHostname, DateTime expiration) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/CertificateHelper.cs:line 260
at Microsoft.Azure.Devices.Edge.Hub.Service.EdgeHubCertificates.LoadAsync(IConfigurationRoot configuration, ILogger logger) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/EdgeHubCertificates.cs:line 54
at Microsoft.Azure.Devices.Edge.Hub.Service.Program.MainAsync(IConfigurationRoot configuration) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/Program.cs:line 54
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at System.Threading.Tasks.Task`1.get_Result()
at Microsoft.Azure.Devices.Edge.Hub.Service.Program.Main() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/Program.cs:line 33
Are you still blocked? what troubleshooting steps you have tried so far? Did you check the Common issues and resolutions for Azure IoT Edge? As per the error messages Transparent gateway certificates not found, operating in quick start mode and The daemon could not start up successfully: Could not initialize module runtime looks like the setup is not configured properly. Try restarting the server and check the transparent gateway setup. Please refer the transparent gateway setup and check if you have missed anything.

Where can I find the default docker ulimit settings?

I have been trying to understand an issue I've had when running roribio16/alpine-sqs docker image on one of my machines. Whenever I try to run the image without specifying any other settings, docker run roribio16/alpine-sqs
[xxxx#yyyy ~]$ docker run roribio16/alpine-sqs
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/elasticmq.conf" during parsing
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/insight.conf" during parsing
2021-05-29 15:48:41,216 INFO Included extra file "/etc/supervisor/conf.d/sqs-init.conf" during parsing
2021-05-29 15:48:41,216 INFO Set uid to user 0 succeeded
2021-05-29 15:48:41,222 INFO RPC interface 'supervisor' initialized
2021-05-29 15:48:41,222 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2021-05-29 15:48:41,222 INFO supervisord started with pid 1
2021-05-29 15:48:42,225 INFO spawned: 'sqs-init' with pid 9
2021-05-29 15:48:42,229 INFO spawned: 'elasticmq' with pid 10
2021-05-29 15:48:42,230 INFO spawned: 'insight' with pid 11
cp: can't stat '/opt/custom/*.conf': No such file or directory
> sqs-insight#0.3.0 start /opt/sqs-insight
> node index.js
15:48:42.605 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
Loading config file from "/opt/sqs-insight/lib/../config/config_local.json"
15:48:42.929 [elasticmq-akka.actor.default-dispatcher-2] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
Unable to load queues for undefined
Config contains 0 queues.
library initialization failed - unable to allocate file descriptor table - out of memorylistening on port 9325
2021-05-29 15:48:43,233 INFO success: sqs-init entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,233 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,234 INFO success: insight entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:43,234 INFO exited: sqs-init (exit status 0; expected)
2021-05-29 15:48:44,318 INFO exited: elasticmq (terminated by SIGABRT (core dumped); not expected)
2021-05-29 15:48:45,322 INFO spawned: 'elasticmq' with pid 67
15:48:45.743 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
15:48:46.044 [elasticmq-akka.actor.default-dispatcher-2] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
library initialization failed - unable to allocate file descriptor table - out of memory2021-05-29 15:48:47,223 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:47,389 INFO exited: elasticmq (terminated by SIGABRT (core dumped); not expected)
2021-05-29 15:48:48,393 INFO spawned: 'elasticmq' with pid 89
15:48:48.766 [main] INFO org.elasticmq.server.Main$ - Starting ElasticMQ server (0.15.0) ...
15:48:49.066 [elasticmq-akka.actor.default-dispatcher-3] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
library initialization failed - unable to allocate file descriptor table - out of memory^C2021-05-29 15:48:49,559 INFO success: elasticmq entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-05-29 15:48:49,559 WARN received SIGINT indicating exit request
2021-05-29 15:48:49,559 INFO waiting for insight, elasticmq to die
2021-05-29 15:48:49,566 INFO stopped: insight (terminated by SIGTERM)
2021-05-29 15:48:50,431 INFO stopped: elasticmq (terminated by SIGABRT (core dumped))
With a bit of googling I found this post where somebody had the same issue when running some other random image, and then posted that they managed to get the image running by setting some ulimits when running the image, which also worked for me (docker run --ulimit nofile=122880:122880 roribio16/alpine-sqs).
I checked the ulimits set inside the container when I didn't use this configuration
docker exec -it ca bash
$ ulimit -a
and found that the nofile setting was ridiculously high, which I assume is what is causing the container to run out of memory, if too many files are being opened simultaneously. I don't have a particulary good understanding of how this works though so would appreciate any clarification somebody could shed on that particular topic also.
Anyway the point of that ramble is that I want to try and find where the default docker container ulimits are set as I don't understand why they are so high on the machine I am using. I have another machine that does not have this problem.
I can find lots of ways to change the default limits but there does not seem to be much information about where these limits get set in the first place. I understand according to the docker documentation that if custom values are not set then the ulimits should be inherited from my system but as far as I can tell my system nofile settings are much lower than what I'm seeing in the container.
(Both machines run manjaro linux however the one that doesn't have this issue is XFCE and the one that does is KDE).

Jenkins windows slave offline

I have Jenkins 2.164.3 on a CentOS 7 server.
I have a Windows Server 2003 slave with Java version 1.8.0.
I have 3 x linux slaves working successfully.
The windows service on the slave is installed and running.
The windows slave is setup with the following with Launch Method "Let jenkins control this Windows salve as a Windows server"
This Jenkins server is a new server that replaced an older jenkins server (debian wheezy from turnkey linux ~3 years ago). This windows slave used to connect to that old server. To remove the connection on this slave to the old server, I did the following:
1. sc delete
2. deleted the files in folder c:\jenkins
3. rebooted server
4. from new jenkins server, launched slave which copied files to c:\jenkins folder and installed service.
On my new jenkins server, I setup the windows slave and when I connect, the log has the following:
[2019-05-27 12:24:07] [windows-slaves] Connecting to 192.168.1.152
Checking if Java exists
java -version returned 1.8.0.
[2019-05-27 12:24:16] [windows-slaves] Copying jenkins-slave.xml
[2019-05-27 12:24:16] [windows-slaves] Copying slave.jar
[2019-05-27 12:24:16] [windows-slaves] Starting the service
[2019-05-27 12:24:16] [windows-slaves] Waiting for the service to become ready
ERROR: [2019-05-27 12:24:52] [windows-slaves] The service did not respond. Perhaps it failed to launch?
[2019-05-27 12:36:00] [windows-slaves] Connecting to 192.168.1.152
Checking if Java exists
java -version returned 1.8.0.
[2019-05-27 12:36:08] [windows-slaves] Copying jenkins-slave.xml
[2019-05-27 12:36:08] [windows-slaves] Copying slave.jar
[2019-05-27 12:36:08] [windows-slaves] Starting the service
ERROR: Unexpected error in launching an agent. This is probably a bug in Jenkins
org.jinterop.dcom.common.JIException: Service Already Running
at org.jvnet.hudson.wmi.Win32Service$Implementation.start(Win32Service.java:149)
Caused: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.kohsuke.jinterop.JInteropInvocationHandler.invoke(JInteropInvocationHandler.java:140)
Caused: java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy90.start(Unknown Source)
at hudson.os.windows.ManagedWindowsServiceLauncher.launch(ManagedWindowsServiceLauncher.java:342)
at hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:294)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The windows slave is Windows Server 2003, the service is installed and running.
In the log file C:\Jenkins\jenkins-slave.wrapper.log, it has the following:
2019-05-27 12:19:32,644 INFO - Starting ServiceWrapper in the service mode
2019-05-27 12:19:32,659 INFO - Starting javaw.exe -Xrs -jar "C:\Jenkins\slave.jar" -tcp "C:\Jenkins\port.txt"
2019-05-27 12:19:32,675 INFO - Extension loaded: killOnStartup
2019-05-27 12:19:32,675 DEBUG - Checking the potentially runaway process with PID=1408
2019-05-27 12:19:32,675 DEBUG - No runaway process with PID=1408. The process has been already stopped.
2019-05-27 12:19:32,675 INFO - Starting javaw.exe -Xrs -jar "C:\Jenkins\slave.jar" -tcp "C:\Jenkins\port.txt"
2019-05-27 12:19:32,691 INFO - Started process 4084
2019-05-27 12:19:32,691 DEBUG - Forwarding logs of the process System.Diagnostics.Process (javaw) to winsw.SizeBasedRollingLogAppender
2019-05-27 12:19:32,691 INFO - Recording PID of the started process:4084. PID file destination is C:\Jenkins\jenkins_agent.pid
2019-05-27 12:23:56,529 INFO - Stopping jenkinsslave-C__Jenkins
2019-05-27 12:23:56,529 DEBUG - ProcessKill 4084
2019-05-27 12:23:56,561 INFO - Stopping process 4084
2019-05-27 12:23:56,561 INFO - Send SIGINT 4084
2019-05-27 12:23:56,561 WARN - SIGINT to 4084 failed - Killing as fallback
2019-05-27 12:23:56,561 INFO - Finished jenkinsslave-C__Jenkins
2019-05-27 12:23:56,561 DEBUG - Completed. Exit code is 0
2019-05-27 12:24:16,374 INFO - Starting ServiceWrapper in the service mode
2019-05-27 12:24:16,390 INFO - Starting javaw.exe -Xrs -jar "C:\Jenkins\slave.jar" -tcp "C:\Jenkins\port.txt"
2019-05-27 12:24:16,405 INFO - Extension loaded: killOnStartup
2019-05-27 12:24:16,405 DEBUG - Checking the potentially runaway process with PID=4084
2019-05-27 12:24:16,405 DEBUG - No runaway process with PID=4084. The process has been already stopped.
2019-05-27 12:24:16,405 INFO - Starting javaw.exe -Xrs -jar "C:\Jenkins\slave.jar" -tcp "C:\Jenkins\port.txt"
2019-05-27 12:24:16,421 INFO - Started process 364
2019-05-27 12:24:16,421 DEBUG - Forwarding logs of the process System.Diagnostics.Process (javaw) to winsw.SizeBasedRollingLogAppender
2019-05-27 12:24:16,421 INFO - Recording PID of the started process:364. PID file destination is C:\Jenkins\jenkins_agent.pid
The error on the jenkins server shows the service is not running. On the windows slave machine, the service is running. What is the problem and how do I fix?
Thanks.
Very old question, but if you end up here because you are getting this error, find the jenkins_agent.pid file and delete it. It should be in the same folder the rest of your jenkins slave files. The service should start again normally after that.
I know this question is old and you've long moved on but maybe this will help someone. I ran into a similar problem with a Windows slave, specifically I was seeing it go through a cycle of restarts much like you were:
2019-05-27 12:24:16,405 DEBUG - Checking the potentially runaway process with PID=4084
2019-05-27 12:24:16,405 DEBUG - No runaway process with PID=4084. The process has been already stopped.
2019-05-27 12:24:16,405 INFO - Starting javaw.exe -Xrs -jar "C:\Jenkins\slave.jar" -tcp "C:\Jenkins\port.txt"
2019-05-27 12:24:16,421 INFO - Started process 364
To solve the problem I checked the following:
See if the Windows service is running, cycle it
In addition to checking the C:\<path to jenkins>\jenkins-slave.wrapper.log also have a look at C:\<path to jenkins>\jenkins-slave.err.log
The err log is where I found my problem, I had an issue with a cert unable to find valid certificate
Edit the C:\<path to jenkins>\jenkins-slave.xml file and fix whatever startup parameter is causing you a problem. Make sure to check the java path and version.
In my certificate error case, I needed to add a -noCertificateCheck to my arguments so I could move on
Another potential downfall maybe that the executable setting in the jenkins-slave.xml config file no longer points to a valid java.exe.
This may happen after a Java update

Apache nifi is not starting up

I am trying to start Apache nifi version 1.2.0 on window 8 machine. It used to start properly. After I restarted the system the nifi is not starting at all. I had check status Its keep getting "Apacha Nifi not running".
Below are logs from nifi.bootstrap.log file:-
2017-07-05 15:41:57,105 WARN [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Failed to set permissions so that only the
owner can read pid file E:\softwares\nifi-1.2.0\bin\..\run\nifi.pid; this
may allows others to have access to the key needed to communicate with NiFi.
Permissions should be changed so that only the owner can read this file
2017-07-05 15:41:57,142 WARN [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Failed to set permissions so that only the
owner can read status file E:\softwares\nifi-1.2.0\bin\..\run\nifi.status;
this may allows others to have access to the key needed to communicate with
NiFi. Permissions should be changed so that only the owner can read this
file
2017-07-05 15:41:57,168 INFO [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Apache NiFi now running and listening for
Bootstrap requests on port 50765
2017-07-05 15:43:12,077 ERROR [NiFi logging handler] org.apache.nifi.StdErr
Failed to start web server: Unable to start Flow Controller.
2017-07-05 15:43:12,078 ERROR [NiFi logging handler] org.apache.nifi.StdErr
Shutting down...
2017-07-05 15:43:14,501 INFO [main] org.apache.nifi.bootstrap.RunNiFi NiFi
never started. Will not restart NiFi
Stack trace from nifi.app.log: -
2017-07-05 15:43:12,077 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
org.apache.nifi.web.NiFiCoreException: Unable to start Flow Controller.
at org.apache.nifi.web.contextlistener.ApplicationStartupContextListener.contextInitialized(ApplicationStartupContextListener.java:88)
at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:876)
at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:532)
at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:839)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:344)
at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1480)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1442)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:799)
at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:261)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:540)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:113)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.handler.gzip.GzipHandler.doStart(GzipHandler.java:290)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.server.Server.start(Server.java:452)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.Server.doStart(Server.java:419)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:695)
at org.apache.nifi.NiFi.<init>(NiFi.java:160)
at org.apache.nifi.NiFi.main(NiFi.java:267)
Caused by: java.io.IOException: Expected to read a Sentinel Byte of '1' but got a value of '0' instead
at org.apache.nifi.repository.schema.SchemaRecordReader.readRecord(SchemaRecordReader.java:65)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeRecord(SchemaRepositoryRecordSerde.java:115)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeEdit(SchemaRepositoryRecordSerde.java:109)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeEdit(SchemaRepositoryRecordSerde.java:46)
at org.wali.MinimalLockingWriteAheadLog$Partition.recoverNextTransaction(MinimalLockingWriteAheadLog.java:1096)
at org.wali.MinimalLockingWriteAheadLog.recoverFromEdits(MinimalLockingWriteAheadLog.java:459)
at org.wali.MinimalLockingWriteAheadLog.recoverRecords(MinimalLockingWriteAheadLog.java:301)
at org.apache.nifi.controller.repository.WriteAheadFlowFileRepository.loadFlowFiles(WriteAheadFlowFileRepository.java:381)
at org.apache.nifi.controller.FlowController.initializeFlow(FlowController.java:712)
at org.apache.nifi.controller.StandardFlowService.initializeController(StandardFlowService.java:953)
at org.apache.nifi.controller.StandardFlowService.load(StandardFlowService.java:534)
at org.apache.nifi.web.contextlistener.ApplicationStartupContextListener.contextInitialized(ApplicationStartupContextListener.java:72)
... 28 common frames omitted
Thanks in advance
After Googling on this error "Caused by: java.io.IOException: Expected to read a Sentinel Byte of '1' but got a value of '0' instead" I found that this error indicates a partial write to the repos.
Here are a couple of things you can check/try to bring your Dataflow back online ;
check if your dsks are not full
Did you launch nifi with the same user ? Did you run it with administrator privileges ?
You can backup/move your repositories and try to start Nifi with empty repositories, you will still have your dataflows there but any file that was processing when you shutdown will be gone.
Could you please try that ?
I think the issue is with incompatible java version, use JAVA 8 version.
If you haven't set JAVA_HOME then set in environment variables with path Like "C:/program files/jdk1.8"
Jira addressing when NiFi run with java 9 version and the issue not resolved yet
https://issues.apache.org/jira/browse/NIFI-4419

Running a bitcoin node on regtest network fails

I'm trying to run a bitcoin network on regtest with this version of bitcoin node so I can test out bitpay's insight-ui block explorer.
Running on regtest I get this repeating error
Assertion failed: (psocket), function Shutdown, file zmq/zmqpublishnotifier.cpp, line 92.
[2017-05-19T00:42:44.515Z] warn: Bitcoin process unexpectedly exited with code: null
[2017-05-19T00:42:44.515Z] warn: Restarting bitcoin child process in 5000ms
[2017-05-19T00:42:49.516Z] info: Using bitcoin config file: /Users/harshagoli/BTCT/bitcoin.conf
[2017-05-19T00:42:49.517Z] warn: Stopping existing spawned bitcoin process with pid: 12690
[2017-05-19T00:42:49.517Z] warn: Unclean bitcoin process shutdown, process not found with pid: 12690
[2017-05-19T00:42:49.517Z] info: Starting bitcoin process
Which eventually becomes
[2017-05-19T00:42:54.133Z] error: RPCError: Bitcoin JSON-RPC: Request Error: connect ECONNREFUSED 127.0.0.1:8332
at Bitcoin._wrapRPCError (/Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:449:13)
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:781:28
at ClientRequest.<anonymous> (/Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/bitcoind-rpc/lib/index.js:116:7)
at emitOne (events.js:77:13)
at ClientRequest.emit (events.js:169:7)
at Socket.socketErrorListener (_http_client.js:269:9)
at emitOne (events.js:77:13)
at Socket.emit (events.js:169:7)
at emitErrorNT (net.js:1269:8)
at nextTickCallbackWith2Args (node.js:458:9)
[2017-05-19T00:42:54.133Z] info: Beginning shutdown
[2017-05-19T00:42:54.133Z] info: Stopping insight-ui (not started)
[2017-05-19T00:42:54.134Z] info: Stopping insight-api (not started)
[2017-05-19T00:42:54.134Z] info: Stopping web (not started)
[2017-05-19T00:42:54.135Z] info: Stopping bitcoind
After which I have the reoccurring error
[2017-05-19T00:42:54.221Z] error: Error: Stopping while trying to spawn bitcoind.
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:905:25
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:676:51
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:726:13
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:52:16
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:264:21
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:44:16
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:723:17
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:167:37
at /Users/harshagoli/mynode/node_modules/bitcore-node/node_modules/async/lib/async.js:652:25
at /Users/harshagoli/mynode/node_modules/bitcore-node/lib/services/bitcoind.js:887:16
Thoughts on how I can get this up and running with a block to look at so I can use the block explorer?
Okay I figured it out. What was happening is there were some other bitcoind processes that had zombied out and were listening on the port this process was trying to access. I ran this command to kill the other processes
killall -9 bitcoind
Also, to create more blocks on regtest (while in the your node directory) use this command.
./node_modules/bitcore-node/bin/bitcoin-0.12.1/bin/bitcoin-cli -datadir=/Users/harshagoli/mynode/data -regtest generate 150

Resources