PreLoginEvent(cancelled=false, cancelReasonComponents=null - timeout

im haveing this problem with auth plugins like PixelLogin DynamicBungeeAuth
my plugins
i have tried 3 to 4 plugins but same error
11:33:13 [WARNING] Plugin listener com.pedrojm96.pixellogin.bungee.PixelBungeeListener took 494ms to process event PreLoginEvent(cancelled=false, cancelReasonComponents=null, connection=[cvgamer,/ip] <-> InitialHandler)!
11:33:13 [WARNING] Event PreLoginEvent(cancelled=false, cancelReasonComponents=null, connection=[cvgamer,/ip] <-> InitialHandler) took 497ms to process!
11:33:43 [WARNING] [cvgamer,/ip] <-> InitialHandler - read timed out

Related

Q: Apache Geode - Unable to reconnect a node after SO patching "15 seconds have elapsed while waiting for replies"

I have a cluster situation consisting of 4 total nodes, 3 servers and 1 management node, working properly.
At the beginning of the month we planned to patch the OS and we started from the first server node with this procedure:
Stop service
S.O. patching
Server restart
Start service
The service of the first patched node named "serverA" fails to restart with this error:
Log entries cluster join:
serverA:
| INFO | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: shared=true ordered=false failed to connect to peer 10.237.110.195( Server serverB:9993):1024 because: java.net.ConnectException: Connection timed out (Connection timed out)
| WARN | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: Attempting reconnect to peer 10.237.110.195( Server serverB:9993):1024
ServerMgmt:
| WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21 | --> 15 seconds have elapsed while waiting for replies: <CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies from [10.237.110.194( Server serverA:632):1024]> on 10.237.110.225( Management:6033):1024 whose current membership list is: [[10.237.110.196( Server serverC:16805):1024, 10.237.110.225( Management:6033):1024, 10.237.110.195( Server serverB:9993):1024, 10.237.110.194( Server serverA:632):1024]]
The connection between the systems was verified with tcpdumps, udp 1024 is running fine.
We have tried redeploying the service and making numerous attempts but we always get the same error during startup.
Any suggestions? Thank you.
Marco.
I think to see this error message, serverA was probably able to send UDP messages to serverB but it is failing to create a TCP connection. It's hard to say why though - a firewall issue, some TCP configuration issue, ... ?
Check to see if serverB has anything interesting in its logs. Since you are using TCP dump, you should be watching for that TCP connection for serverB:9993, since it looks like that is wwhat failed.
There is no firewall between the systems, we've analyzed again the network connection, during startup from node a, and we can see that the communication can be established between all systems. But what we detected is, that on port 2323 which is configured as locater, the node sends packages to the b and c node, but only receives back packages from the c node, and not from the b node. This is for us again a sign that the b node has an issue. Does it give a way to check our assumption from the b node?
A node ip .194
B node ip .195
C node ip .196
Management ip .225

Ktor needs 1 hour(forever) to boot up

I have a ktor app. I works fine when I run it in development mode. I package it in a docker image by copying over what the gradle application plugin provided. That also works fine on my local machine 8 cores. But now the strange part. When I do exactly the same thing on a rented V-Server also running Ubuntu-20.04 like my local system, ktor is incredible slow.
docker-compose logs server:
server | 2021-08-24 08:00:23.337 [main] INFO ktor.application - Autoreload is disabled because the development mode is off.
server | 2021-08-24 08:25:35.048 [main] INFO ktor.application - Autoreload is disabled because the development mode is off.
server | 2021-08-24 09:18:48.246 [main] INFO c.e.e.s.TemplateStore - Starting to parse Sentences
server | 2021-08-24 09:18:48.345 [main] INFO c.e.e.s.TemplateStore - Finished parsing sentences
server | 2021-08-24 09:18:48.346 [main] INFO ktor.application - Responding at http://0.0.0.0:8080
server | 2021-08-24 09:18:48.347 [main] INFO ktor.application - Application started in 3193.32 seconds.
Application started in 3193.32 seconds
The source code can be found here https://github.com/1-alex98/whatisthat . It has a docker-compose.yml defining the whole docker container being started.
Local system 32 gb ram + 8 cores . V-Server 4 gb Ram + 2 cores (htop shows pleinty of resources are free).
I am looking for ideas on what in the world could cause this behavior. Or ways to debug it.
Update:
Seems to read a file forever:
"main" #1 prio=5 os_prio=0 cpu=652.14ms elapsed=173.92s tid=0x00007f01d4016000 nid=0xe runnable [0x00007f01dace6000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(java.base#11.0.12/Native Method)
at java.io.FileInputStream.read(java.base#11.0.12/FileInputStream.java:279)
at java.io.FilterInputStream.read(java.base#11.0.12/FilterInputStream.java:133)
at sun.security.provider.NativePRNG$RandomIO.readFully(java.base#11.0.12/NativePRNG.java:424)
at sun.security.provider.NativePRNG$RandomIO.ensureBufferValid(java.base#11.0.12/NativePRNG.java:526)
at sun.security.provider.NativePRNG$RandomIO.implNextBytes(java.base#11.0.12/NativePRNG.java:545)
- locked <0x00000000c7571158> (a java.lang.Object)
at sun.security.provider.NativePRNG$Blocking.engineNextBytes(java.base#11.0.12/NativePRNG.java:268)
at java.security.SecureRandom.nextBytes(java.base#11.0.12/SecureRandom.java:751)
at kotlin.random.AbstractPlatformRandom.nextBytes(PlatformRandom.kt:47)
at kotlin.random.Random.nextBytes(Random.kt:260)
at com.example.routes.websocket.WebsocketRoutingKt.<clinit>(WebsocketRouting.kt:40)
at com.example.plugins.RoutingKt$routing$1.invoke(Routing.kt:13)
at com.example.plugins.RoutingKt$routing$1.invoke(Routing.kt:11)
at io.ktor.routing.Routing$Feature.install(Routing.kt:106)
at io.ktor.routing.Routing$Feature.install(Routing.kt:88)
at io.ktor.application.ApplicationFeatureKt.install(ApplicationFeature.kt:68)
at io.ktor.routing.RoutingKt.routing(Routing.kt:129)
at com.example.plugins.RoutingKt.routing(Routing.kt:11)
at com.example.ApplicationKt$main$1.invoke(Application.kt:18)
at com.example.ApplicationKt$main$1.invoke(Application.kt:14)
at io.ktor.server.engine.internal.CallableUtilsKt.executeModuleFunction(CallableUtils.kt:50)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$launchModuleByName$1.invoke(ApplicationEngineEnvironmentReloading.kt:317)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$launchModuleByName$1.invoke(ApplicationEngineEnvironmentReloading.kt:316)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.avoidingDoubleStartupFor(ApplicationEngineEnvironmentReloading.kt:341)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.launchModuleByName(ApplicationEngineEnvironmentReloading.kt:316)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.access$launchModuleByName(ApplicationEngineEnvironmentReloading.kt:30)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$instantiateAndConfigureApplication$1.invoke(ApplicationEngineEnvironmentReloading.kt:304)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading$instantiateAndConfigureApplication$1.invoke(ApplicationEngineEnvironmentReloading.kt:295)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.avoidingDoubleStartup(ApplicationEngineEnvironmentReloading.kt:323)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.instantiateAndConfigureApplication(ApplicationEngineEnvironmentReloading.kt:295)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.createApplication(ApplicationEngineEnvironmentReloading.kt:136)
at io.ktor.server.engine.ApplicationEngineEnvironmentReloading.start(ApplicationEngineEnvironmentReloading.kt:268)
at io.ktor.server.netty.NettyApplicationEngine.start(NettyApplicationEngine.kt:174)
at com.example.ApplicationKt.main(Application.kt:21)
at com.example.ApplicationKt.main(Application.kt)
It is a fresh rented server but I guess something is wrong with it
docker-compose being slow and my program not starting seemed to be due to insufficient(not good enough) input to /dev/urandom. Installing https://github.com/smuellerDD/jitterentropy-rngd resolved the problem.

Why is "java.nio.channels.ClosedByInterruptExceptio" called when caling multiple groupBy with pyspark?

I am running a pyspark job (python 3.5, spark 2.1, java8) in yarn-client mode from an edge node with spark2-submit. The job succed, the result dataframe is written on HDFS and seems correct (we didn't find yet any error with the data in such dataframe).
The issue is that I see a lot (6'000) ERROR messages and I would like to understand what is wrong and if this impact or not the final dataframe.
All ERROR messages looks like this one:
18/06/01 14:08:36 INFO codegen.CodeGenerator: Code generated in 45.712788 ms
18/06/01 14:08:37 INFO executor.Executor: Finished task 33.0 in stage 34.0 (TID 2312). 4600 bytes result sent to driver
18/06/01 14:08:37 INFO executor.Executor: Finished task 117.0 in stage 34.0 (TID 2316). 3801 bytes result sent to driver
18/06/01 14:08:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2512
18/06/01 14:08:40 INFO executor.Executor: Running task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 193 blocks
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 1 ms
18/06/01 14:08:40 INFO executor.Executor: Executor is trying to kill task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /...../yarn/nm/usercache/../appcache/application_xxxx/blockmgr-xxxx/temp_shuffle_xxxxx
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:238)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The ERROR start after quite some feture engineering (select, groupby ..) and I see the ERROR when adding these lines:
df = (df.groupby('x','y')
.agg(func.sum('x').alias('x_sum'))
.groupby('y')
.agg(func.mean('y').alias('py_sum_avg')))
So I guess the of the data shuffle is triggered by groupBy.
I first thought it was an issue with memory so I added much more memory and overhead memory for the driver and executor without a real success (this is what you can find in some other thread). In the code I have other groupBy and it seems it is causing some issue at this stage.
I also see that it could be related to too many files open or if the disk is full but the ERROR messages is a bit different in these 2 cases.
I am quite new in pysaprk so I am looking to advice to debug such issue.
How can I find what is the reason why is called java.nio.channels.ClosedByInterruptException ? I guess this is the reason that trigger ERROR storage.DiskBlockObjectWriter. Is this correct ? Is it trigger by Executor: Executor is trying to kill task 190 If this is a standard process to have some tasks killed why is this triggering ERRORs ? Can I get some hint by looking at the Sprak UI (I see that some task were killed).Can I get more info from the traceback ?
How can fixed these issues ? Any suggestion how to proceed to debug such things ? I am not sure how to proceed to debug this issue and where to look at (memory, issue in the pysaprk code, issue with the setup of the cluster or of my spark params)
I am working on an Hadoop Data Lake with Cloudera CDH 5.8.
There is an issue with using spark.speculation in Spark 2.1 which I am using.
The related upstream bug is SPARK-19293. The exception stack trace in my situation is slightly different than the one in SPARK-19293. Putting
--conf spark.speculation=false
and the ERROR are gone in my test

Kafka stream application stopped with org.apache.kafka.common.errors.TimeoutException

I have installed Kafka 1.0.0 with help of docker composer and I am running this Kafka successfully with two brokers. I created a topic manually with partition and inserted the events.
Now I am running a application with 1.0.0 Kafka Stream by pointing to this Kafka. After running my application for some time, Following messages were showing in log and stopped from run. Except producer request.timeout.ms, all other config parameters are default parameters and producer request.timeout.ms is 120seconds.
Before stopping with below messages, Couple of times I observed 'Trying to rejoin the consumer group now. org.apache.kafka.streams.errors.TaskMigratedException:' and 'Caused by: org.apache.kafka.clients.consumer.CommitFailedException:' messages in the log.
What would be the possible reason? Please help me.
Messages before stopping:
2017-12-07 06:17:03,122 WARN o.a.k.c.p.i.Sender [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] [Producer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] Got error produce response with sample id 14099 on topic-partition abc-0, retrying (9 attempts left). Error: NETWORK_EXCEPTION
2017-12-07 06:18:02,675 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c9ade532af0412fc7bcc.5a12c9ade532af0412fc7bca value com.sample.kafka.streams.SampleEvent#4a56c681 timestamp 1512363589768) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for abc-0: 189836 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-07 06:18:02,927 INFO o.a.k.c.c.i.AbstractCoordinator [sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1] [Consumer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-consumer, groupId=sample-app-0.0.1] Discovered coordinator 1.1.1.1:32775 (id: 2147482645 rack: null)

Dataflow job errors: "'The resource 'projects/<removed>/zones/us-central1-a/disks/<removed>-harness-0' is not ready'

One of our pipelines failed this morning with an error we've never seen before. In addition, we had to manually delete the one VM that was was spun up to cancel/stop the job.
Has anything changed in the Dataflow service that could cause this error?
0 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - PipelineOptions.filesToStage was not specified. Defaulting to files from the classpath: will stage 49 files. Enable logging at DEBUG level to see which files will be staged.
2243 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Successfully created cloud dataflow service pipeline
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: Clicks_06_2015
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: ActiveViews_06_2015
2282 [main] INFO com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Last loaded table was found. It will be processed for denormalization: Impressions_06_2015
2435 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.advertisers2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2615 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.lineitems2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2616 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.creative2name2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
2616 [main] WARN com.google.cloud.dataflow.sdk.Pipeline - Transform <removed>:<removed>.adunit2site2 does not have a stable unique name. In the future, this will prevent reloading streaming pipelines
3236 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - Executing pipeline on the Dataflow Service, which will have billing implications related to Google Compute Engine usage and other Google Cloud Services.
3241 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading 49 files from PipelineOptions.filesToStage to staging location to prepare for execution.
41834 [main] INFO com.google.cloud.dataflow.sdk.util.PackageUtil - Uploading PipelineOptions.filesToStage complete: 10 files newly uploaded, 39 files cached
Dataflow SDK version: 0.4.150602
51003 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To access the Dataflow monitoring console, please navigate to https://console.developers.google.com/project/<removed>/dataflow/job/2015-06-11_16_39_02-17130055143605818331
Submitted job: 2015-06-11_16_39_02-17130055143605818331
51004 [main] INFO com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner - To cancel the job using the 'gcloud' tool, run:
> gcloud alpha dataflow jobs --project=<removed> cancel 2015-06-11_16_39_02-17130055143605818331
2015-06-11T23:39:02.506Z: Detail: (b056559940543e6a): Expanding GroupByKey operations into optimizable parts.
2015-06-11T23:39:02.509Z: Detail: (b056559940543d60): Annotating graph with Autotuner information.
2015-06-11T23:39:02.759Z: Detail: (b0565599405437a9): Fusing adjacent ParDo, Read, Write, and Flatten operations
2015-06-11T23:39:02.762Z: Detail: (b05655994054369f): Fusing consumer Impressions_06_2015-ParDoDFP-transform into Impressions_06_2015-BQ-Read
2015-06-11T23:39:02.764Z: Detail: (b056559940543595): Fusing consumer Impressions_06_2015-BQ-Write into Impressions_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.766Z: Detail: (b05655994054348b): Fusing consumer ActiveViews_06_2015-ParDoDFP-transform into ActiveViews_06_2015-BQ-Read
2015-06-11T23:39:02.767Z: Detail: (b056559940543381): Fusing consumer ActiveViews_06_2015-BQ-Write into ActiveViews_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.769Z: Detail: (b056559940543277): Fusing consumer Clicks_06_2015-ParDoDFP-transform into Clicks_06_2015-BQ-Read
2015-06-11T23:39:02.771Z: Detail: (b05655994054316d): Fusing consumer Clicks_06_2015-BQ-Write into Clicks_06_2015-ParDoDFP-transform
2015-06-11T23:39:02.818Z: Detail: (b056559940543987): Adding StepResource setup and teardown to workflow graph.
2015-06-11T23:39:18.614Z: Error: (5494fb7a460f58a8): Workflow failed. Causes: (20fbc2bb0e7cb0b1): One or more operations had an error: 'operation-1434065943092-518467f1f5b21-8d000d8a-d5cd5762': 'The resource 'projects/<removed>/zones/us-central1-a/disks/dfp-denormalization-job-1-06111639-3db5-harness-0' is not ready'.
2015-06-11T23:39:18.651Z: Detail: (4fb958a4957733a5): Cleaning up.
2015-06-11T23:40:36.126Z: Error: (d41cf136c17a5e79): Workflow failed. Causes: (20fbc2bb0e7cb0b1): One or more operations had an error: 'operation-1434065943092-518467f1f5b21-8d000d8a-d5cd5762': 'The resource 'projects/<removed>/zones/us-central1-a/disks/dfp-denormalization-job-1-06111639-3db5-harness-0' is not ready'.
2015-06-11T23:43:05.998Z: Warning: (c5964e114f42988b): Job 2015-06-11_16_39_02-17130055143605818331 is already finishing. Ignoring cancel request.
2015-06-11T23:48:04.715Z: Warning: (cf462c726cde3704): Job 2015-06-11_16_39_02-17130055143605818331 is already finishing. Ignoring cancel request.
2015-06-11T23:50:35.529Z: Warning: Internal Issue (4fb958a495773599): 65177287:8503
748739 [main] INFO com.google.cloud.dataflow.sdk.runners.BlockingDataflowPipelineRunner - Job finished with status FAILED
748740 [main] ERROR com.<removed>.cdf.dfp.DFPDenormalizationCloudDataFlowJob - Job "dfp-denormalization-job-1434066640362" failed. Job may be retried.
This was a temporary issue with the Google Compute Engine API that has since been resolved. When calling GCE on behalf of the user, Dataflow will attempt to work around any transient errors.

Resources