How to monitor elasticsearch using nagios - monitoring

I would like to monitor elasticsearch using nagios.
Basiclly, I want to know if elasticsearch is up.
I think I can use the elasticsearch Cluster Health API (see here)
and use the 'status' that I get back (green, yellow or red), but I still don't know how to use nagios for that matter ( nagios is on one server and elasticsearc is on another server ).
Is there another way to do that?
EDIT :
I just found that - check_http_json. I think I'll try it.

After a while - I've managed to monitor elasticsearch using the nrpe.
I wanted to use the elasticsearch Cluster Health API - but I couldn't use it from another machine - due to security issues...
So, in the monitoring server I created a new service - which the check_command is check_command check_nrpe!check_elastic. And now in the remote server, where the elasticsearch is, I've editted the nrpe.cfg file with the following:
command[check_elastic]=/usr/local/nagios/libexec/check_http -H localhost -u /_cluster/health -p 9200 -w 2 -c 3 -s green
Which is allowed, since this command is run from the remote server - so no security issues here...
It works!!!
I'll still try this check_http_json command that I posted in my qeustion - but for now, my solution is good enough.

After playing around with the suggestions in this post, I wrote a simple check_elasticsearch script. It returns the status as OK, WARNING, and CRITICAL corresponding to the "status" parameter in the cluster health response ("green", "yellow", and "red" respectively).
It also grabs all the other parameters from the health page and dumps them out in the standard Nagios format.
Enjoy!

Shameless plug: https://github.com/jersten/check-es
You can use it with ZenOSS/Nagios to monitor cluster health, data indices, and individual node heap usage.

You can use this cool Python script for monitoring your Elasticsearch cluster. This script check your IP:port for Elasticsearch status. This one and more Python script for monitoring Elasticsearch can be found here.
#!/usr/bin/python
from nagioscheck import NagiosCheck, UsageError
from nagioscheck import PerformanceMetric, Status
import urllib2
import optparse
try:
import json
except ImportError:
import simplejson as json
class ESClusterHealthCheck(NagiosCheck):
def __init__(self):
NagiosCheck.__init__(self)
self.add_option('H', 'host', 'host', 'The cluster to check')
self.add_option('P', 'port', 'port', 'The ES port - defaults to 9200')
def check(self, opts, args):
host = opts.host
port = int(opts.port or '9200')
try:
response = urllib2.urlopen(r'http://%s:%d/_cluster/health'
% (host, port))
except urllib2.HTTPError, e:
raise Status('unknown', ("API failure", None,
"API failure:\n\n%s" % str(e)))
except urllib2.URLError, e:
raise Status('critical', (e.reason))
response_body = response.read()
try:
es_cluster_health = json.loads(response_body)
except ValueError:
raise Status('unknown', ("API returned nonsense",))
cluster_status = es_cluster_health['status'].lower()
if cluster_status == 'red':
raise Status("CRITICAL", "Cluster status is currently reporting as "
"Red")
elif cluster_status == 'yellow':
raise Status("WARNING", "Cluster status is currently reporting as "
"Yellow")
else:
raise Status("OK",
"Cluster status is currently reporting as Green")
if __name__ == "__main__":
ESClusterHealthCheck().run()

I wrote this a million years ago, and it might still be useful: https://github.com/radu-gheorghe/check-es
But it really depends on what you want to monitor. The above measures:
if Elasticsearch responds to HTTP
if ingestion rate drops under the defined levels
if total number of documents drops the defined levels
But of course there's much more that might be interesting. From query time to JVM heap usage. We wrote a blog post about the most important ones here: https://sematext.com/blog/top-10-elasticsearch-metrics-to-watch/
Elasticsearch has APIs for all these, so you may be able to use a generic check_http_json to get the needed metrics. Alternatively, you may want to use something like Sematext Monitoring for Elasticsearch, which gets these metrics out of the box, then forward threshold/anomaly alerts to Nagios. (disclosure: I work for Sematext)

Related

how to set up proper printout destination for dask multiprocessing in jupyter notebook on linux

I am using dask in jupyter notebook on a linux server to run python functions on multiple CPUs. The python functions have standard print statement. I would like the output of the print to be shown in the jupyter notebook right below the cell. However, the print out were all shown in the console. Can anyone explain why this happens and how to make dask.function.print output to the notebook, or both the console and the notebook.
The following is a simplified version of the problem:
import dask
import functools
from dask import compute, delayed
iter_list=[0,1]
def iFunc(item):
print('Meme',item)
# call this function itself will print normally to the
# notebook below the cell, desired.
with dask.config.set(scheduler='processes',num_workers=2):
func1=functools.partial(iFunc)
ret=compute([delayed(func1)(item) for item in iter_list])
# surprisingly, Meme 0, Meme 1 only print out to the console,
# not the notebook, Not desired, hard to debug. Any clue?
The whole point of dask is leveraging multiple threads, processes, or nodes/machines to distribute work. The workers you create are therefore not on the same thread as your client, and may not be on the same process, or even the same machine (or like, in the same country) as your client, depending on how you set up your cluster.
If you start a LocalCluster from your jupyter notebook, whether you're using threads or processes, you should see printed output appearing as output in the cells which execute jobs on the workers:
In [1]: import dask.distributed as dd
In [2]: client = dd.Client(processes=4)
In [3]: def job():
...: print("hello from a worker!")
In [4]: client.submit(job).result()
hello from a worker!
However, if a different process is spinning up your workers, it is up to that process to decide how to handle stdout. So if you're spinning up workers using the jupyterlab terminal, stdout will appear there. If you're spinning up workers in a kubernetes pod, stdout will appear in the worker logs. Dask doesn't actively manage standard out, so it's up to you to handle this. Note that this also applies to logging - neither stdout nor logs are captured by dask. This is actually a really important design feature - many distributed systems have their own systems for managing the standard out & logging of nodes, and dask does not want to impose its own parallel/conflicting system for handling output. The main focus of dask is executing the tasks, not managing a distributed logging system.
That said, dask does have the infrastructure for passing around messages, and this is something the package could support. There is an open issue and pull request attempting to add this ability as a feature, but it looks like there are a lot of open design questions that would need to be resolved before this could be added. Many of them revolve around the issues I raised above - how to add a clean distributed logging feature without overburdening the scheduler, complicating the already complex set of configuration options, or overriding the important, existing logging systems users rely on. The dask core team seems to agree that this is a good idea, if the tough design questions can be resolved.
You certainly always have the option of returning messages. For example, the following would work:
In [10]: def job():
...: return_blob = {"diagnostics": {}, "messages": [], "return_val": None}
...: start = time.time()
...: return_blob["diagnostics"]["start"] = start
...:
...: try:
...: return_blob["messages"].append("raising error")
...: # this causes a DivideByZeroError
...: return_blob["return_val"] = 1 / 0
...: except Exception as e:
...: return_blob["diagnostics"]["error"] = e
...:
...: return_blob["diagnostics"]["end"] = time.time()
...: return return_blob
...:
In [11]: client.submit(job).result()
Out[11]:
{'diagnostics': {'start': 1644091274.738912,
'error': ZeroDivisionError('division by zero'),
'end': 1644091274.7389162},
'messages': ['raising error'],
'return_val': None}

ksqlDB server shuts down when config, offset and status topic is changed

I'm running a single ksqlDB Server on embedded mode on our Kubernetes cluster and I want to add a connector.
Adding a connector produces a Request timed out on Kafka Connect exactly similar to this blog post by Robin Moffatt.
So he suggests to change the KAFKA_OFFSET_REPLICATION_FACTOR contained in his docker-compose example.
But unfortunately in our Test environment, I don't have easy access to the existing Kafka cluster (we have admins there), so I think the fastest way to go about is to instead change the:
KSQL_CONNECT_CONFIG_STORAGE_TOPIC - change to a different topic name
KSQL_CONNECT_OFFSET_STORAGE_TOPIC
KSQL_CONNECT_STATUS_STORAGE_TOPIC
KSQL_CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR to -1 (originally this value is 1)
KSQL_CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR to -1 (originally this value is 1)
KSQL_CONNECT_STATUS_STORAGE_REPLICATION_FACTOR to -1 (originally this value is 1)
But when I change the topic names, I can see that new topics are created (using ksqlDB's SHOW TOPICS command), but it always shuts down and restarts forever, here are the logs:
[2021-07-22 01:27:19,889] INFO ProcessingLogConfig values:
ksql.logging.processing.rows.include = false
ksql.logging.processing.stream.auto.create = false
ksql.logging.processing.stream.name = KSQL_PROCESSING_LOG
ksql.logging.processing.topic.auto.create = false
ksql.logging.processing.topic.name =
ksql.logging.processing.topic.partitions = 1
ksql.logging.processing.topic.replication.factor = 1
(io.confluent.ksql.logging.processing.ProcessingLogConfig:372)
[2021-07-22 01:27:19,891] ERROR Aborting application start (io.confluent.ksql.rest.server.KsqlRestApplication:378)
io.confluent.ksql.rest.server.KsqlRestApplication$AbortApplicationStartException: Shutting down application during waitForPreconditions
at io.confluent.ksql.rest.server.KsqlRestApplication.waitForPreconditions(KsqlRestApplication.java:441)
at io.confluent.ksql.rest.server.KsqlRestApplication.startKsql(KsqlRestApplication.java:386)
at io.confluent.ksql.rest.server.KsqlRestApplication.startAsync(KsqlRestApplication.java:370)
at io.confluent.ksql.rest.server.MultiExecutable.doAction(MultiExecutable.java:68)
at io.confluent.ksql.rest.server.MultiExecutable.startAsync(MultiExecutable.java:42)
at io.confluent.ksql.rest.server.KsqlServerMain.tryStartApp(KsqlServerMain.java:89)
at io.confluent.ksql.rest.server.KsqlServerMain.main(KsqlServerMain.java:64)
[2021-07-22 01:27:19,892] INFO Server up and running (io.confluent.ksql.rest.server.KsqlServerMain:90)
[2021-07-22 01:27:19,892] INFO Server shutting down (io.confluent.ksql.rest.server.KsqlServerMain:96)
[2021-07-22 01:27:19,892] INFO ksqlDB shutdown called (io.confluent.ksql.rest.server.KsqlRestApplication:498)
[2021-07-22 01:27:34,926] INFO API server stopped (io.confluent.ksql.api.server.Server:196)
[2021-07-22 01:27:34,927] INFO ksqlDB shutdown complete (io.confluent.ksql.rest.server.KsqlRestApplication:553)
I don't have anymore details, it's just that.
When I return the config, offset and status topic names to what I had at first, the ksqlDB Server starts fine, but again I'm stuck with the problem that I can't create connectors.
I have a fear that when I attempt to delete the topics manually, ksqlDB server wont be able to start properly because it keeps on finding the original config, offset and status topics I had at first.
I have solved the issue, apparently using -1 as value for:
KSQL_CONNECT_CONFIG_REPLICATION_FACTOR
KSQL_CONNECT_OFFSET_REPLICATION_FACTOR
KSQL_CONNECT_STATUS_REPLICATION_FACTOR
doesn’t work properly, the Config topic becomes 20 partitions,
when I read in the Confluent Docs it should only be 1 partition, I think that’s why the ksqlDB Server just restarts endlessly, I just need to gather the right evidences.
Turning those values to 3 (which is our Kafka broker's default rep factor config) I think solved the issue, it was hard, because no error message/s are seen, like when it doesn’t want more than 1 partition of the created Config topic.

OpenTSDB Plotting internal stats

I'm fairly new to OpenTSDB but I managed to set it up inside a docker container and connect Grafana to it.
Now I'm looking for a way to keep track of it's health. In particular, I would like to plot some of the metrics that come from the internal stats (e.g. tsd.rpc.received).
When I try to use them as a regular metric in the Graph panel of OpenTSDB I get a "java.lang.RuntimeException: Unexpected exception".
I know I could connect the http api (/api/stats) to another tool to then send the metrics to cloudwatch or a similar app. But I was hoping for something that didn't involved adding more pieces to the solution.
In the documentation I found: "The Telnet style API also supports the "stats" command for fetching over CLI. These can easily be published right back into OpenTSDB at any interval you like."
Is this the recommended way to keep track of those internal metrics? Read from the stats api and then feed them back to OpenTSDB?
After looking different alternatives I found that the best way to get the internal stats is either using the tcollector util or injecting the output of the stats command back into opentsdb and using grafana to visualize the data.
Since in my particular case I don't want to install another component like tcollector, I'll feed the stats back to opentsdb as metrics in every node.
This is a small script I wrote to feed the stats back via the telnet api.
#!/bin/bash
while true; do
sleep 5
STATSINPUT=$(echo "stats" | nc 0 4242 -w1)
while IFS= read -r line
do
echo "Feed: $line"
INPUT="put $line"
echo $INPUT | nc 0 4242 -w0
done < <(printf '%s\n' "$STATSINPUT")
done

MySQL connection pool in python?

I'm trying to process large amount of data using Python and maintaining processing status in MySQL. However, I'm surprised there is no standard connection pool for python-mysql (like HikariCP in Java).
I initially started with PyMySQL, things were great until the program ran for first few hours. After few hours, things started to fail. I was getting lot of errors like:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 99] Cannot assign requested address)")
Moreover, lot of ports were stuck in TIME_WAIT state because I'm opening and closing connections too frequently because of lack of connection pooling
/d/p/950 ❯❯❯ netstat -nt | wc -l
84752
Per this and this, I tried to set tcp_fin_timeout and ip_local_port_range, but hardly anything improved.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 15000 65000 > /proc/sys/net/ipv4/ip_local_port_range
Then I found out that MySQL provides mysql.connector which comes with pooling functionality. After doing all that performance actually deteriorated. More processes started to get failed. I'm using Python's multiprocessing module to simultaneously run 29 processes(multiprocessing.Pool picked this no by default) on a 24 core machine. Following was the code, of course I was using .my.cnf to pass all the credential to avoid committing them to git :
import mysql.connector
from mysql.connector import pooling
conn_pool = pooling.MySQLConnectionPool(pool_name="mypool1",
pool_size=pooling.CNX_POOL_MAXSIZE,
option_files=MYSQL_CONFIG,
option_groups=MYSQL_GROUP_NODE1,
allow_local_infile=True)
conn = conn_pool.get_connection()
Finally, reverted back to old code. Still using PyMySQL and though errors are less frequent it is still causing a major problem. I looked at SQLAlchemy and couldn't really found much of a documentation around pooling.
I'm wondering how's everyone else dealing with mysql-python connection pooling issue? I really believe there should be something out there so that I don't have to reinvent the wheel.
Any pointers are much appreciated.
DBUtils implements MySQL (and generally claims to support abritrary DB-API 2 compliant database interfaces) user-sized connection pool PooledDB, thead-mapped pool PersistentDB and SteadyDB (see functionality section). The latter should fit your case where multiprocessing.Pool creates worker processes with managed persistent database connection each. It is described as:
DBUtils.SteadyDB is a module implementing "hardened" connections to a database, based on ordinary connections made by any DB-API 2 database module. A "hardened" connection will transparently reopen upon access when it has been closed or the database connection has been lost or when it is used more often than an optional usage limit.
You can use it with PyMySQL like:
import pymysql
from DBUtils.SteadyDB import connect
db = connect(
creator = pymysql, # the rest keyword arguments belong to pymysql
user = 'guest', password = '', database = 'name',
autocommit = True, charset = 'utf8mb4',
cursorclass = pymysql.cursors.DictCursor)
Also see this related question for more examples.

"statsd" prefix on data when using Telegraf with StatsD

I am using Telegraf as a server to collect StatsD data from Python and send it to InfluxDB. However, the name values I receive in InfluxDB have the prefix "statsd_". How can I remove it?
In python I do:
ctr_name = 'foo'
client = statsd.StatsClient('MY_DOMAIN.com', 8125)
client.incr(ctr_name, 1)
And then in InfluxDB I see:
> show measurements
name: measurements
------------------
name
statsd_foo
By default, the statsd python package does not put a prefix in front of your measurements.
Below is the default
client = statsd.StatsClient('MY_DOMAIN.com', 8125, prefix=None)
I am pretty sure you set prefix="statsd" in your actual code
The problem does not seem to come from the statsd plugin configuration in Telegraf since it doesnt offer to prefix all measurements with a string constant. See here

Resources