IoTEdge on K8S, Could not initialize module runtime - azure-iot-edge

I'm running iotedge on kubernetes.
The K8S cluster is a local cluster setup largely using the "Kubernetes the hard way" method, with some modifications.
I did manage to get things working on one installation. However, I'm now getting this on another installation. The initial installation works fine, but after shutting down a machine to simulate a hardware failure, the pod gets recreated, but starts to show this error again. This error happens EVEN if the node shutdown is NOT the one iotedged is running on.
Environment
3 Nodes running Ubuntu 20.04 LTS
Two networks on each node, one for the internet, one for an internal network. K8S is setup using the internal, static IP address
HAProxy/Keepalived for HA without a load balancer, running on a Virtual IP address
Multus CNI for attaching pods to additional networks
CoreDNS
Troubleshooting
Confirmed that CoreDNS seems to be functioning fine, and is able to resolve internal and external addresses
Remaining nodes are able to ping pods on other nodes
Deleting the iotedged pod and allowing k8s to recreate it works, but then edgeAgent an edgeHub have errors until I delete/recreate them as well
Re-run the entire k8s installation. Initial installation works fine, but simulating machine failure continues to be problematic.
Kubernetes Versions:
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:25:06Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
edgeiotd error:
<6>2021-07-09T22:00:02Z [INFO] - Starting Azure IoT Edge Security Daemon - Kubernetes mode
<6>2021-07-09T22:00:02Z [INFO] - Version - 1.1.3
<6>2021-07-09T22:00:02Z [INFO] - Using config file: /etc/iotedged/config.yaml
<6>2021-07-09T22:00:02Z [INFO] - Configuring /var/lib/iotedge as the home directory.
<6>2021-07-09T22:00:02Z [INFO] - Configuring certificates...
<6>2021-07-09T22:00:02Z [INFO] - Transparent gateway certificates not found, operating in quick start mode...
<6>2021-07-09T22:00:02Z [INFO] - Finished configuring provisioning environment variables and certificates.
<6>2021-07-09T22:00:02Z [INFO] - Initializing hsm...
<6>2021-07-09T22:00:02Z [INFO] - Finished initializing hsm.
<6>2021-07-09T22:00:02Z [INFO] - Provisioning edge device...
<6>2021-07-09T22:00:02Z [INFO] - Starting provisioning edge device via manual mode using a device connection string...
<6>2021-07-09T22:00:02Z [INFO] - Manually provisioning device "********" in hub "********.azure-devices.net"
<6>2021-07-09T22:00:02Z [INFO] - Finished provisioning edge device.
<6>2021-07-09T22:00:02Z [INFO] - Initializing the module runtime...
<6>2021-07-09T22:00:02Z [INFO] - Attempting to use config from /home/edgeletuser/.kube/config file.
<6>2021-07-09T22:00:02Z [INFO] - Using in-cluster config
<3>2021-07-09T22:00:34Z [ERR!] - The daemon could not start up successfully: Could not initialize module runtime
<3>2021-07-09T22:00:34Z [ERR!] - caused by: Could not initialize kubernetes module runtime
<3>2021-07-09T22:00:34Z [ERR!] - caused by: HTTP response error: SelfSubjectAccessReviewCreate
<3>2021-07-09T22:00:34Z [ERR!] - caused by: Hyper HTTP error
<3>2021-07-09T22:00:34Z [ERR!] - caused by: error trying to connect: Connection timed out (os error 110)
<6>2021-07-09T22:00:02Z [INFO] (/project/hsm-sys/azure-iot-hsm-c/src/hsm_log.c:log_init:41) Initialized logging
edgeHub Logs after recreating iotedged:
2021-08-18 19:05:40 Starting Edge Hub
2021-08-18 19:05:40.481 +00:00 Edge Hub Main()
<7> 2021-08-18 19:05:40.609 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClient] - Making a Http call to http://localhost:35001/ to CreateServerCertificateAsync
<7> 2021-08-18 19:05:40.912 +00:00 [DBG] [Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClient] - Error when getting an Http response from http://localhost:35001/ for CreateServerCertificateAsync
HTTP Response:
{"message":"Module not found"}
Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.IoTEdgedException`1[Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.ErrorResponse]: Not Found
at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.GeneratedCode.HttpWorkloadClient.CreateServerCertificateAsync(String api_version, String name, String genid, ServerCertificateRequest request, CancellationToken cancellationToken) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/generatedCode/HttpWorkloadClient.cs:line 624
at Microsoft.Azure.Devices.Edge.Util.TaskEx.TimeoutAfter[T](Task`1 task, TimeSpan timeout) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/TaskEx.cs:line 126
at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 59
Unhandled exception. System.AggregateException: One or more errors occurred. (Error calling CreateServerCertificateAsync: Module not found)
---> Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadCommunicationException- Message:Error calling CreateServerCertificateAsync: Module not found, StatusCode:404, at: at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.HandleException(Exception ex, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 109
at Microsoft.Azure.Devices.Edge.Util.Edged.WorkloadClientVersioned.Execute[T](Func`1 func, String operation) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/WorkloadClientVersioned.cs:line 77
at Microsoft.Azure.Devices.Edge.Util.Edged.Version_2019_01_30.WorkloadClient.CreateServerCertificateAsync(String hostname, DateTime expiration) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/edged/version_2019_01_30/WorkloadClient.cs:line 35
at Microsoft.Azure.Devices.Edge.Util.CertificateHelper.GetServerCertificatesFromEdgelet(Uri workloadUri, String workloadApiVersion, String workloadClientApiVersion, String moduleId, String moduleGenerationId, String edgeHubHostname, DateTime expiration) in /home/vsts/work/1/s/edge-util/src/Microsoft.Azure.Devices.Edge.Util/CertificateHelper.cs:line 260
at Microsoft.Azure.Devices.Edge.Hub.Service.EdgeHubCertificates.LoadAsync(IConfigurationRoot configuration, ILogger logger) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/EdgeHubCertificates.cs:line 54
at Microsoft.Azure.Devices.Edge.Hub.Service.Program.MainAsync(IConfigurationRoot configuration) in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/Program.cs:line 54
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at System.Threading.Tasks.Task`1.get_Result()
at Microsoft.Azure.Devices.Edge.Hub.Service.Program.Main() in /home/vsts/work/1/s/edge-hub/src/Microsoft.Azure.Devices.Edge.Hub.Service/Program.cs:line 33

Are you still blocked? what troubleshooting steps you have tried so far? Did you check the Common issues and resolutions for Azure IoT Edge? As per the error messages Transparent gateway certificates not found, operating in quick start mode and The daemon could not start up successfully: Could not initialize module runtime looks like the setup is not configured properly. Try restarting the server and check the transparent gateway setup. Please refer the transparent gateway setup and check if you have missed anything.

Related

Ktor needs 1 hour(forever) to boot up

I have a ktor app. I works fine when I run it in development mode. I package it in a docker image by copying over what the gradle application plugin provided. That also works fine on my local machine 8 cores. But now the strange part. When I do exactly the same thing on a rented V-Server also running Ubuntu-20.04 like my local system, ktor is incredible slow.
docker-compose logs server:
server | 2021-08-24 08:00:23.337 [main] INFO ktor.application - Autoreload is disabled because the development mode is off.
server | 2021-08-24 08:25:35.048 [main] INFO ktor.application - Autoreload is disabled because the development mode is off.
server | 2021-08-24 09:18:48.246 [main] INFO c.e.e.s.TemplateStore - Starting to parse Sentences
server | 2021-08-24 09:18:48.345 [main] INFO c.e.e.s.TemplateStore - Finished parsing sentences
server | 2021-08-24 09:18:48.346 [main] INFO ktor.application - Responding at http://0.0.0.0:8080
server | 2021-08-24 09:18:48.347 [main] INFO ktor.application - Application started in 3193.32 seconds.
Application started in 3193.32 seconds
The source code can be found here https://github.com/1-alex98/whatisthat . It has a docker-compose.yml defining the whole docker container being started.
Local system 32 gb ram + 8 cores . V-Server 4 gb Ram + 2 cores (htop shows pleinty of resources are free).
I am looking for ideas on what in the world could cause this behavior. Or ways to debug it.
Update:
Seems to read a file forever:
"main" #1 prio=5 os_prio=0 cpu=652.14ms elapsed=173.92s tid=0x00007f01d4016000 nid=0xe runnable [0x00007f01dace6000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base#11.0.12/Native Method)
at java.io.FileInputStream.read(java.base#11.0.12/FileInputStream.java:279)
at java.io.FilterInputStream.read(java.base#11.0.12/FilterInputStream.java:133)
at sun.security.provider.NativePRNG$RandomIO.readFully(java.base#11.0.12/NativePRNG.java:424)
at sun.security.provider.NativePRNG$RandomIO.ensureBufferValid(java.base#11.0.12/NativePRNG.java:526)
at sun.security.provider.NativePRNG$RandomIO.implNextBytes(java.base#11.0.12/NativePRNG.java:545)
- locked <0x00000000c7571158> (a java.lang.Object)
at sun.security.provider.NativePRNG$Blocking.engineNextBytes(java.base#11.0.12/NativePRNG.java:268)
at java.security.SecureRandom.nextBytes(java.base#11.0.12/SecureRandom.java:751)
at kotlin.random.AbstractPlatformRandom.nextBytes(PlatformRandom.kt:47)
at kotlin.random.Random.nextBytes(Random.kt:260)
at com.example.routes.websocket.WebsocketRoutingKt.<clinit>(WebsocketRouting.kt:40)
at com.example.plugins.RoutingKt$routing$1.invoke(Routing.kt:13)
at com.example.plugins.RoutingKt$routing$1.invoke(Routing.kt:11)
at io.ktor.routing.Routing$Feature.install(Routing.kt:106)
at io.ktor.routing.Routing$Feature.install(Routing.kt:88)
at io.ktor.application.ApplicationFeatureKt.install(ApplicationFeature.kt:68)
at io.ktor.routing.RoutingKt.routing(Routing.kt:129)
at com.example.plugins.RoutingKt.routing(Routing.kt:11)
at com.example.ApplicationKt$main$1.invoke(Application.kt:18)
at com.example.ApplicationKt$main$1.invoke(Application.kt:14)
at io.ktor.server.engine.internal.CallableUtilsKt.executeModuleFunction(CallableUtils.kt:50)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$launchModuleByName$1.invoke(ApplicationEngineEnvironmentReloading.kt:317)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$launchModuleByName$1.invoke(ApplicationEngineEnvironmentReloading.kt:316)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.avoidingDoubleStartupFor(ApplicationEngineEnvironmentReloading.kt:341)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.launchModuleByName(ApplicationEngineEnvironmentReloading.kt:316)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.access$launchModuleByName(ApplicationEngineEnvironmentReloading.kt:30)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$instantiateAndConfigureApplication$1.invoke(ApplicationEngineEnvironmentReloading.kt:304)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$instantiateAndConfigureApplication$1.invoke(ApplicationEngineEnvironmentReloading.kt:295)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.avoidingDoubleStartup(ApplicationEngineEnvironmentReloading.kt:323)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.instantiateAndConfigureApplication(ApplicationEngineEnvironmentReloading.kt:295)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.createApplication(ApplicationEngineEnvironmentReloading.kt:136)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.start(ApplicationEngineEnvironmentReloading.kt:268)
at io.ktor.server.netty.NettyApplicationEngine.start(NettyApplicationEngine.kt:174)
at com.example.ApplicationKt.main(Application.kt:21)
at com.example.ApplicationKt.main(Application.kt)
It is a fresh rented server but I guess something is wrong with it
docker-compose being slow and my program not starting seemed to be due to insufficient(not good enough) input to /dev/urandom. Installing https://github.com/smuellerDD/jitterentropy-rngd resolved the problem.

Nebula Graph fails on CentOS 6.5

Nebula Graph fails on CentOS 6.5, the error message is as follows:
# storage log
Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
# meta log
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0415 22:32:38.944437 15532 AsyncServerSocket.cpp:762] failed to set SO_REUSEPORT on async server socket Protocol not available
E0415 22:32:38.945001 15510 ThriftServer.cpp:440] Got an exception while setting up the server: 92failed to bind to async server socket: [::]:0: Protocol not available
E0415 22:32:38.945057 15510 RaftexService.cpp:90] Setup the Raftex Service failed, error: 92failed to bind to async server socket: [::]:0: Protocol not available
E0415 22:32:38.949586 15463 NebulaStore.cpp:47] Start the raft service failed
E0415 22:32:38.949597 15463 MetaDaemon.cpp:88] Nebula store init failed
E0415 22:32:38.949796 15463 MetaDaemon.cpp:215] Init kv failed!
Nebula service status is as follows:
[root#redhat6 scripts]# ./nebula.service status all
[WARN] The maximum files allowed to open might be too few: 1024
[INFO] nebula-metad: Exited
[INFO] nebula-graphd: Exited
[INFO] nebula-storaged: Running as 15547, Listening on 44500
Reason for error: CentOS 6.5 system kernel version is 2.6.32, which is less than 3.9. However, SO_REUSEPORT only supports Linux 3.9 and above.
Upgrading the system to CentOS 7.5 can solve the problem by itself.

Helm's Tiller container gets x509: certificate signed by unknown authority

I'm running Kubernetes on an AWS (version 1.5.2). I have installed helm using
helm init --node-selectors="nodeType=master"
forcing it running on the master.
When I try to run helm list i get the following error Error: Get https://192.0.0.1:443/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%3DTILLER: x509: certificate signed by unknown authority
The logs from the tiller container (seems the issue is from the tiller to Kubernetes-api):
E0219 08:15:12.546100 1 config.go:330] Expected to load root CA config from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, but got err: open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory
E0219 08:15:12.547957 1 config.go:330] Expected to load root CA config from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, but got err: open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory
[main] 2018/02/19 08:15:12 Starting Tiller v2.7.0 (tls=false)
[main] 2018/02/19 08:15:12 GRPC listening on :44134
[main] 2018/02/19 08:15:12 Probes listening on :44135
[main] 2018/02/19 08:15:12 Storage driver is ConfigMap
[main] 2018/02/19 08:15:12 Max history per release is 0
[storage] 2018/02/19 08:20:47 listing all releases with filter
[storage/driver] 2018/02/19 08:20:47 list: failed to list: Get https://192.0.0.1:443/api/v1/namespaces/kube-system/configmaps?labelSelector=OWNER%3DTILLER: x509: certificate signed by unknown authority
Is there a way to configure tiller to ignore the untrusted certificate?
It looks like your Kubernetes cluster isn't properly configured. Usually there is a CA certificate for every pod in /var/run/secrets/kubernetes.io/serviceaccount/ca.crt that allows pods to communicate with the API server.
The first two lines in your log show that no such file could be found:
Expected to load root CA config from /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, but got err: open /var/run/secrets/kubernetes.io/serviceaccount/ca.crt: no such file or directory.

Apache nifi is not starting up

I am trying to start Apache nifi version 1.2.0 on window 8 machine. It used to start properly. After I restarted the system the nifi is not starting at all. I had check status Its keep getting "Apacha Nifi not running".
Below are logs from nifi.bootstrap.log file:-
2017-07-05 15:41:57,105 WARN [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Failed to set permissions so that only the
owner can read pid file E:\softwares\nifi-1.2.0\bin\..\run\nifi.pid; this
may allows others to have access to the key needed to communicate with NiFi.
Permissions should be changed so that only the owner can read this file
2017-07-05 15:41:57,142 WARN [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Failed to set permissions so that only the
owner can read status file E:\softwares\nifi-1.2.0\bin\..\run\nifi.status;
this may allows others to have access to the key needed to communicate with
NiFi. Permissions should be changed so that only the owner can read this
file
2017-07-05 15:41:57,168 INFO [NiFi Bootstrap Command Listener]
org.apache.nifi.bootstrap.RunNiFi Apache NiFi now running and listening for
Bootstrap requests on port 50765
2017-07-05 15:43:12,077 ERROR [NiFi logging handler] org.apache.nifi.StdErr
Failed to start web server: Unable to start Flow Controller.
2017-07-05 15:43:12,078 ERROR [NiFi logging handler] org.apache.nifi.StdErr
Shutting down...
2017-07-05 15:43:14,501 INFO [main] org.apache.nifi.bootstrap.RunNiFi NiFi
never started. Will not restart NiFi
Stack trace from nifi.app.log: -
2017-07-05 15:43:12,077 WARN [main] org.apache.nifi.web.server.JettyServer Failed to start web server... shutting down.
org.apache.nifi.web.NiFiCoreException: Unable to start Flow Controller.
at org.apache.nifi.web.contextlistener.ApplicationStartupContextListener.contextInitialized(ApplicationStartupContextListener.java:88)
at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:876)
at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:532)
at org.eclipse.jetty.server.handler.ContextHandler.startContext(ContextHandler.java:839)
at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:344)
at org.eclipse.jetty.webapp.WebAppContext.startWebapp(WebAppContext.java:1480)
at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1442)
at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:799)
at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:261)
at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:540)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:113)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.handler.gzip.GzipHandler.doStart(GzipHandler.java:290)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:131)
at org.eclipse.jetty.server.Server.start(Server.java:452)
at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:105)
at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:113)
at org.eclipse.jetty.server.Server.doStart(Server.java:419)
at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68)
at org.apache.nifi.web.server.JettyServer.start(JettyServer.java:695)
at org.apache.nifi.NiFi.<init>(NiFi.java:160)
at org.apache.nifi.NiFi.main(NiFi.java:267)
Caused by: java.io.IOException: Expected to read a Sentinel Byte of '1' but got a value of '0' instead
at org.apache.nifi.repository.schema.SchemaRecordReader.readRecord(SchemaRecordReader.java:65)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeRecord(SchemaRepositoryRecordSerde.java:115)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeEdit(SchemaRepositoryRecordSerde.java:109)
at org.apache.nifi.controller.repository.SchemaRepositoryRecordSerde.deserializeEdit(SchemaRepositoryRecordSerde.java:46)
at org.wali.MinimalLockingWriteAheadLog$Partition.recoverNextTransaction(MinimalLockingWriteAheadLog.java:1096)
at org.wali.MinimalLockingWriteAheadLog.recoverFromEdits(MinimalLockingWriteAheadLog.java:459)
at org.wali.MinimalLockingWriteAheadLog.recoverRecords(MinimalLockingWriteAheadLog.java:301)
at org.apache.nifi.controller.repository.WriteAheadFlowFileRepository.loadFlowFiles(WriteAheadFlowFileRepository.java:381)
at org.apache.nifi.controller.FlowController.initializeFlow(FlowController.java:712)
at org.apache.nifi.controller.StandardFlowService.initializeController(StandardFlowService.java:953)
at org.apache.nifi.controller.StandardFlowService.load(StandardFlowService.java:534)
at org.apache.nifi.web.contextlistener.ApplicationStartupContextListener.contextInitialized(ApplicationStartupContextListener.java:72)
... 28 common frames omitted
Thanks in advance
After Googling on this error "Caused by: java.io.IOException: Expected to read a Sentinel Byte of '1' but got a value of '0' instead" I found that this error indicates a partial write to the repos.
Here are a couple of things you can check/try to bring your Dataflow back online ;
check if your dsks are not full
Did you launch nifi with the same user ? Did you run it with administrator privileges ?
You can backup/move your repositories and try to start Nifi with empty repositories, you will still have your dataflows there but any file that was processing when you shutdown will be gone.
Could you please try that ?
I think the issue is with incompatible java version, use JAVA 8 version.
If you haven't set JAVA_HOME then set in environment variables with path Like "C:/program files/jdk1.8"
Jira addressing when NiFi run with java 9 version and the issue not resolved yet
https://issues.apache.org/jira/browse/NIFI-4419

Failed to bind to: spark-master, using a remote cluster with two workers

I am managing to get everything working with the local master and two remote workers. Now, I want to connect to a remote master that has the same remote workers. I have tried different combinations of settings withing the /etc/hosts and other reccomendations on the Internet, but NOTHING worked.
The Main class is:
public static void main(String[] args) {
ScalaInterface sInterface = new ScalaInterface(CHUNK_SIZE,
"awsAccessKeyId",
"awsSecretAccessKey");
SparkConf conf = new SparkConf().setAppName("POC_JAVA_AND_SPARK")
.setMaster("spark://spark-master:7077");
org.apache.spark.SparkContext sc = new org.apache.spark.SparkContext(
conf);
sInterface.enableS3Connection(sc);
org.apache.spark.rdd.RDD<Tuple2<Path, Text>> fileAndLine = (RDD<Tuple2<Path, Text>>) sInterface.getMappedRDD(sc, "s3n://somebucket/");
org.apache.spark.rdd.RDD<String> pInfo = (RDD<String>) sInterface.mapPartitionsWithIndex(fileAndLine);
JavaRDD<String> pInfoJ = pInfo.toJavaRDD();
List<String> result = pInfoJ.collect();
String miscInfo = sInterface.getMiscInfo(sc, pInfo);
System.out.println(miscInfo);
}
It fails at:
List<String> result = pInfoJ.collect();
The error I am getting is:
1354 [sparkDriver-akka.actor.default-dispatcher-3] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
1354 [main] WARN org.apache.spark.util.Utils - Service 'sparkDriver' could not bind on port 0. Attempting port 1.
1355 [main] DEBUG org.apache.spark.util.AkkaUtils - In createActorSystem, requireCookie is: off
1363 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1364 [sparkDriver-akka.actor.default-dispatcher-3] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
1364 [sparkDriver-akka.actor.default-dispatcher-5] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1367 [sparkDriver-akka.actor.default-dispatcher-4] INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
1370 [sparkDriver-akka.actor.default-dispatcher-6] INFO Remoting - Starting remoting
1380 [sparkDriver-akka.actor.default-dispatcher-4] ERROR akka.remote.transport.netty.NettyTransport - failed to bind to spark-master/192.168.0.191:0, shutting down Netty transport
Exception in thread "main" 1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
1382 [sparkDriver-akka.actor.default-dispatcher-6] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
java.net.BindException: Failed to bind to: spark-master/192.168.0.191:0: Service 'sparkDriver' failed after 16 retries!
at org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:393)
at akka.remote.transport.netty.NettyTransport$$anonfun$listen$1.apply(NettyTransport.scala:389)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:206)
at scala.util.Try$.apply(Try.scala:161)
at scala.util.Success.map(Try.scala:206)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
1383 [sparkDriver-akka.actor.default-dispatcher-7] INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
1385 [delete Spark temp dirs] DEBUG org.apache.spark.util.Utils - Shutdown hook called
Thank you kindly for your help!
Setting the environment variable SPARK_LOCAL_IP=127.0.0.1 solved this for me.
I had this problem when my /etc/hosts file was mapping the wrong IP address to my local hostname.
The BindException in your logs complains about the IP address 192.168.0.191. I assume that resolves to the hostname of your machine and it's not the actual IP address that your network interface is using. It should work fine once you fix that.
I had spark working in my EC2 instance. I started a new web server and to meet its requirement I had to change hostname to ec2 public DNS name i.e.
hostname ec2-54-xxx-xxx-xxx.compute-1.amazonaws.com
After that my spark could not work and showed error as below:
16/09/20 21:02:22 WARN Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.
16/09/20 21:02:22 ERROR SparkContext: Error initializing SparkContext.
I solve it by setting SPARK_LOCAL_IP to as below:
export SPARK_LOCAL_IP="localhost"
then just launched sparkling shell as below:
$SPARK_HOME/bin/spark-shell
Possily your master is running on non-default port. Can you post your submit command?
Have a look in https://spark.apache.org/docs/latest/spark-standalone.html#connecting-an-application-to-the-cluster

Resources