how to fix fluentd(td-agent) buffer problem?

how to fix fluentd(td-agent) buffer problem? - fluentd

I have an fluentd, elasticsearch, graylog setup and I'm getting below error intermittently in td-agent log
[warn]: temporarily failed to flush the buffer. next_retry=2019-01-27
19:00:14 -0500 error_class="ArgumentError" error="Data too big (189382
bytes), would create more than 128 chunks!"
plugin_id="object:3fee25617fbc"
Because of this cache memory increases and td-agent fails to send messages to graylog
I have tried setting the buffer_chunk_limit to 8m and flush_interval time to 5sec

Related

My neo4j server is automatically stopping and starting,

I'm running my neo4j community edition 3.5.5 version with 8GB ram in aws instance.
Initially for few months it ran very fine and got results in millis of time, but now a days it's getting stopping automatically and starting automatically. Sometimes it's not at all starting for hours,even we started it manually also.
Can anyone please help me with this. I'm getting the below logs.
tail -100f /var/log/neo4j/neo4j.log
2019-07-29 13:17:52.570+0000 WARN The client is unauthorized due to authentication failure.
2019-09-04 05:33:52.328+0000 WARN The client is unauthorized due to authentication failure.
2019-10-17 15:18:14.652+0000 INFO Transaction with id 2683388 has been automatically rolled back due to transaction timeout.
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid8965.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
There is insufficient memory for the Java Runtime Environment to continue.
Native memory allocation (mmap) failed to map 3670016000 bytes for committing reserved memory.
An error report file with more information is saved as:
/home/ubuntu/hs_err_pid9050.log
nohup: ignoring input
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000006e5400000, 3670016000, 0) failed; error='Cannot allocate memory' (errno=12)
2019-10-17 17:14:44.651+0000 INFO Transaction with id 2689294 has been automatically rolled back due to transaction timeout.

this can be because you are running lot of merge operations and dont have proper indices created or try increasing the heap size in config file .

Why is "java.nio.channels.ClosedByInterruptExceptio" called when caling multiple groupBy with pyspark?

I am running a pyspark job (python 3.5, spark 2.1, java8) in yarn-client mode from an edge node with spark2-submit. The job succed, the result dataframe is written on HDFS and seems correct (we didn't find yet any error with the data in such dataframe).
The issue is that I see a lot (6'000) ERROR messages and I would like to understand what is wrong and if this impact or not the final dataframe.
All ERROR messages looks like this one:
18/06/01 14:08:36 INFO codegen.CodeGenerator: Code generated in 45.712788 ms
18/06/01 14:08:37 INFO executor.Executor: Finished task 33.0 in stage 34.0 (TID 2312). 4600 bytes result sent to driver
18/06/01 14:08:37 INFO executor.Executor: Finished task 117.0 in stage 34.0 (TID 2316). 3801 bytes result sent to driver
18/06/01 14:08:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2512
18/06/01 14:08:40 INFO executor.Executor: Running task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Getting 28 non-empty blocks out of 193 blocks
18/06/01 14:08:40 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 1 ms
18/06/01 14:08:40 INFO executor.Executor: Executor is trying to kill task 190.1 in stage 34.0 (TID 2512)
18/06/01 14:08:40 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /...../yarn/nm/usercache/../appcache/application_xxxx/blockmgr-xxxx/temp_shuffle_xxxxx
java.nio.channels.ClosedByInterruptException
at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
at sun.nio.ch.FileChannelImpl.truncate(FileChannelImpl.java:372)
at org.apache.spark.storage.DiskBlockObjectWriter.revertPartialWritesAndClose(DiskBlockObjectWriter.scala:212)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.stop(BypassMergeSortShuffleWriter.java:238)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The ERROR start after quite some feture engineering (select, groupby ..) and I see the ERROR when adding these lines:
df = (df.groupby('x','y')
.agg(func.sum('x').alias('x_sum'))
.groupby('y')
.agg(func.mean('y').alias('py_sum_avg')))
So I guess the of the data shuffle is triggered by groupBy.
I first thought it was an issue with memory so I added much more memory and overhead memory for the driver and executor without a real success (this is what you can find in some other thread). In the code I have other groupBy and it seems it is causing some issue at this stage.
I also see that it could be related to too many files open or if the disk is full but the ERROR messages is a bit different in these 2 cases.
I am quite new in pysaprk so I am looking to advice to debug such issue.
How can I find what is the reason why is called java.nio.channels.ClosedByInterruptException ? I guess this is the reason that trigger ERROR storage.DiskBlockObjectWriter. Is this correct ? Is it trigger by Executor: Executor is trying to kill task 190 If this is a standard process to have some tasks killed why is this triggering ERRORs ? Can I get some hint by looking at the Sprak UI (I see that some task were killed).Can I get more info from the traceback ?
How can fixed these issues ? Any suggestion how to proceed to debug such things ? I am not sure how to proceed to debug this issue and where to look at (memory, issue in the pysaprk code, issue with the setup of the cluster or of my spark params)
I am working on an Hadoop Data Lake with Cloudera CDH 5.8.

There is an issue with using spark.speculation in Spark 2.1 which I am using.
The related upstream bug is SPARK-19293. The exception stack trace in my situation is slightly different than the one in SPARK-19293. Putting
--conf spark.speculation=false
and the ERROR are gone in my test

Kafka stream application stopped with org.apache.kafka.common.errors.TimeoutException

I have installed Kafka 1.0.0 with help of docker composer and I am running this Kafka successfully with two brokers. I created a topic manually with partition and inserted the events.
Now I am running a application with 1.0.0 Kafka Stream by pointing to this Kafka. After running my application for some time, Following messages were showing in log and stopped from run. Except producer request.timeout.ms, all other config parameters are default parameters and producer request.timeout.ms is 120seconds.
Before stopping with below messages, Couple of times I observed 'Trying to rejoin the consumer group now. org.apache.kafka.streams.errors.TaskMigratedException:' and 'Caused by: org.apache.kafka.clients.consumer.CommitFailedException:' messages in the log.
What would be the possible reason? Please help me.
Messages before stopping:
2017-12-07 06:17:03,122 WARN o.a.k.c.p.i.Sender [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] [Producer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] Got error produce response with sample id 14099 on topic-partition abc-0, retrying (9 attempts left). Error: NETWORK_EXCEPTION
2017-12-07 06:18:02,675 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c9ade532af0412fc7bcc.5a12c9ade532af0412fc7bca value com.sample.kafka.streams.SampleEvent#4a56c681 timestamp 1512363589768) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 9 record(s) for abc-0: 189836 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-07 06:18:02,927 INFO o.a.k.c.c.i.AbstractCoordinator [sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1] [Consumer clientId=sample-app-0.0.1-7f99fa3f-4487-48dc-af3f-9296ee513452-StreamThread-1-consumer, groupId=sample-app-0.0.1] Discovered coordinator 1.1.1.1:32775 (id: 2147482645 rack: null)

Why are Google Pipeline VM instances hanging indefinitely?

I am using Dockerflow to run parallel tasks through the Google Pipelines API on Google Cloud Platform. I started a single-step task running 1389 VMs in parallel and found that 233 of the VMs were apparently doing nothing and hanging indefinitely.
I did a spot check of the serial console output and repeatedly saw the VMs running into "Getting controller config failed" errors.
When I tried logging into the VMs I received the error: "Connection Failed. We are unable to connect to the VM on port 22".
I am wondering why my VM instances are hanging, and if there is something I can do to avoid running into these issues.
I've included a snippet of the serial console output below
startupscript: +++ readlink -f /usr/share/google-genomics/startup.sh
startupscript: ++ dirname /usr/share/google-genomics/startup.sh
startupscript: + cd /usr/share/google-genomics
startupscript: + ./controller --operation_id <id> --validation_token <token> --base_path https://genomics.googleapis.com
create controller[2905]: Getting controller config
create controller[2905]: Getting controller config failed, will retry: Get <link>: Get <service_account_token_link>: net/http: timeout awaiting response headers
create controller[2905]: Getting controller config failed, will retry: Get <link>: dial tcp 74.125.26.95:443: i/o timeout
collectd[2342]: write_gcm: Asking metadata server for auth token
collectd[2342]: write_gcm: curl_easy_perform() failed: Couldn't connect to server
collectd[2342]: write_gcm: Error -1 from wg_curl_get_or_post
collectd[2342]: write_gcm: wg_transmit_unique_segment failed.
collectd[2342]: write_gcm: wg_transmit_unique_segments failed. Flushing.

there was a temporary networking issue in us-east1-b. All 3 above VMs were in us-east1-b. These minor incidents do not appear in https://status.cloud.google.com/
Serial console output for a successful run looks like:
A Feb 21 19:05:06 ggp-5629907348021283130 startupscript: + ./controller --operation_id --validation_token --base_path https://autopush-genomics.sandbox.googleapis.com
A Feb 21 19:05:06 ggp-5629907348021283130 create controller[2689]: Getting controller config
A Feb 21 19:05:36 ggp-5629907348021283130 create controller[2689]: Getting controller config failed, will retry: Get https://genomics.googleapis.com/v1alpha2/pipelines:getControllerConfig?alt=json&operationId=&validationToken=: dial tcp 173.194.212.81:443: i/o timeout
A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Switching to status: pulling-image
A Feb 21 19:05:43 ggp-5629907348021283130 controller[2689]: Calling SetOperationStatus(pulling-image)
A Feb 21 19:05:44 ggp-5629907348021283130 controller[2689]: SetOperationStatus(pulling-image) succeeded
The "Getting controller config failed, will retry" is fine. It succeeded upon retry. The "SetOperationStatus(pulling-image) succeeded" indicates networking is working.
In theory, you can submit any number of jobs to Pipelines API and the API will take care of queueing.
If these temporary networking hiccups become common, we may consider changing Pipelines API to somehow detect and retry.

there may have been a temporary networking issue. Can you give me some failed operation ids (or failed VM names)?
Have you tried again since then; can you reproduce the problem?

Neo4j giving error message in ubuntu

As i am new for neo4j i have been facing the follwing errors.
1.when i start neo4j it gives the follwoing message.
WARNING: Max 1024 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.
Using additional JVM arguments: -server -XX:+DisableExplicitGC -Dorg.neo4j.server.properties=conf/neo4j-server.properties -Djava.util.logging.config.file=conf/logging.properties -Dlog4j.configuration=file:conf/log4j.properties -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
note :i tried to edit the file ,/etc/security/limits.conf and added
root soft nofile 40000
root hard nofile 40000
but not solved
2.in the messages.log file has multiple records like below.
2014-07-16 07:07:49.688+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 111ms [total block time: 56.805s]
2014-07-16 07:09:02.778+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 103ms [total block time: 56.908s]
The problem is some time suddently the server CPU goes high and taking few hours to get down.Please give me a proper idea.
Thanks
Az

1) Best practice on Ubuntu is not setting this in ´/etc/security/limits.confdirectly, instead create a file/etc/security/limits.d/neo4j.conf` containing:
* soft nofile 40000
* hard nofile 40000
2) This is more information for you how much time is spent in GC. If a single pause gets too long it's an indication to tweak JVM settings. Stop times of 100ms is not really concerning in most cases. However the "total block time" of almost one minute might require further investigation.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart