OpenTSDB Plotting internal stats - monitoring

I'm fairly new to OpenTSDB but I managed to set it up inside a docker container and connect Grafana to it.
Now I'm looking for a way to keep track of it's health. In particular, I would like to plot some of the metrics that come from the internal stats (e.g. tsd.rpc.received).
When I try to use them as a regular metric in the Graph panel of OpenTSDB I get a "java.lang.RuntimeException: Unexpected exception".
I know I could connect the http api (/api/stats) to another tool to then send the metrics to cloudwatch or a similar app. But I was hoping for something that didn't involved adding more pieces to the solution.
In the documentation I found: "The Telnet style API also supports the "stats" command for fetching over CLI. These can easily be published right back into OpenTSDB at any interval you like."
Is this the recommended way to keep track of those internal metrics? Read from the stats api and then feed them back to OpenTSDB?

After looking different alternatives I found that the best way to get the internal stats is either using the tcollector util or injecting the output of the stats command back into opentsdb and using grafana to visualize the data.
Since in my particular case I don't want to install another component like tcollector, I'll feed the stats back to opentsdb as metrics in every node.
This is a small script I wrote to feed the stats back via the telnet api.
#!/bin/bash
while true; do
sleep 5
STATSINPUT=$(echo "stats" | nc 0 4242 -w1)
while IFS= read -r line
do
echo "Feed: $line"
INPUT="put $line"
echo $INPUT | nc 0 4242 -w0
done < <(printf '%s\n' "$STATSINPUT")
done

Related

Nao robot IMU data rates

I'm trying to stream data from the Nao's inertial unit in its trunk. However the update rate is quite slow ~ 1Hz. Is there any way to improve it? For reference, I issued the following command using qicli to measure the rates:
qicli call --json ALMemory.getListData "[[\"Device/SubDeviceList/InertialSensor/AngleY/Sensor/Value\"]]"
In this example I retrieve the tilt angle of the trunk around the Y-axis (pitch).
To execute this command, I established an SSH connection to the Nao. I timed it using the linux time command. I also tried to force a faster read rate by issuing the above command in a loop with 5 milliseconds of sleep between each iteration:
for i in {1..100}; do qicli call --json ALMemory.getListData "[[\"Device/SubDeviceList/InertialSensor/AngleY/Sensor/Value\"]]"; sleep 0.005; done
But even in this case I could see that the data was read at about a rate of 1Hz.
I tried it on Nao versions 5 and 6. I also connected both over WiFi and a link-locally using an ethernet cable.
This data is available every 10ms, but a qicli call takes a long time to init the connection.
Try using the api in python, create a proxy then call the getData in the loop, refer to the API documentation here.
As a side note, best way to record data or to monitor it efficiently is to process it directly on the NAO. Connect using ssh upload your program and run it, or use choregraphe to create and run it directly on the robot easily.
# edit: adding simple script to be run directly on NAO (untested)
import time
import naoqi
mem = naoqi.ALProxy("ALMemory","localhost",9559)
while 1:
val = mem.getData("Device/SubDeviceList/InertialSensor/AngleY/Sensor/Value")
print(val)
time.sleep(0.01)

MySQL connection pool in python?

I'm trying to process large amount of data using Python and maintaining processing status in MySQL. However, I'm surprised there is no standard connection pool for python-mysql (like HikariCP in Java).
I initially started with PyMySQL, things were great until the program ran for first few hours. After few hours, things started to fail. I was getting lot of errors like:
pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on '127.0.0.1' ([Errno 99] Cannot assign requested address)")
Moreover, lot of ports were stuck in TIME_WAIT state because I'm opening and closing connections too frequently because of lack of connection pooling
/d/p/950 ❯❯❯ netstat -nt | wc -l
84752
Per this and this, I tried to set tcp_fin_timeout and ip_local_port_range, but hardly anything improved.
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 15000 65000 > /proc/sys/net/ipv4/ip_local_port_range
Then I found out that MySQL provides mysql.connector which comes with pooling functionality. After doing all that performance actually deteriorated. More processes started to get failed. I'm using Python's multiprocessing module to simultaneously run 29 processes(multiprocessing.Pool picked this no by default) on a 24 core machine. Following was the code, of course I was using .my.cnf to pass all the credential to avoid committing them to git :
import mysql.connector
from mysql.connector import pooling
conn_pool = pooling.MySQLConnectionPool(pool_name="mypool1",
pool_size=pooling.CNX_POOL_MAXSIZE,
option_files=MYSQL_CONFIG,
option_groups=MYSQL_GROUP_NODE1,
allow_local_infile=True)
conn = conn_pool.get_connection()
Finally, reverted back to old code. Still using PyMySQL and though errors are less frequent it is still causing a major problem. I looked at SQLAlchemy and couldn't really found much of a documentation around pooling.
I'm wondering how's everyone else dealing with mysql-python connection pooling issue? I really believe there should be something out there so that I don't have to reinvent the wheel.
Any pointers are much appreciated.
DBUtils implements MySQL (and generally claims to support abritrary DB-API 2 compliant database interfaces) user-sized connection pool PooledDB, thead-mapped pool PersistentDB and SteadyDB (see functionality section). The latter should fit your case where multiprocessing.Pool creates worker processes with managed persistent database connection each. It is described as:
DBUtils.SteadyDB is a module implementing "hardened" connections to a database, based on ordinary connections made by any DB-API 2 database module. A "hardened" connection will transparently reopen upon access when it has been closed or the database connection has been lost or when it is used more often than an optional usage limit.
You can use it with PyMySQL like:
import pymysql
from DBUtils.SteadyDB import connect
db = connect(
creator = pymysql, # the rest keyword arguments belong to pymysql
user = 'guest', password = '', database = 'name',
autocommit = True, charset = 'utf8mb4',
cursorclass = pymysql.cursors.DictCursor)
Also see this related question for more examples.

Aggregate-counter on an existing stream

I'm trying to create an aggregate counter for various streams I have set up. In SpringXD it would look like this: "tap:stream:MyCustomStream > aggregate-counter".
In Spring Cloud Dataflow so far I have done ":MyKafkaTopic > aggregate-counter", which seems to create a Kafka consumer and read the payload to determine a count of events on the topic. I'd like to be able to tap any stream not just a Kafka source, e.g. "MyApp1 | MyApp2" --name MyCustomStream.
The provided example "stream create --definition ":mainstream.http > counter" --name tap_at_http --deploy" essentially assumes mainstream.http is a Kafka topic (or RabbitMQ topic).
Anyone done this before?
Going by your example,
stream create foo --definition "MyApp1 | MyApp2"
If you'd have to TAP the foo stream at the producer, MyApp1 level, your TAP stream would like the following.
stream create bar --definition ":foo.MyApp1 > MyApp3"
You're just pointing to the producer in the stream where you'd like to TAP to get a copy of same data. The format is: :<streamName>.<label/appName>. You could use "labels" instead of app names, too. Please review the reference guide for more details.
The provided example "stream create --definition ":mainstream.http > counter" --name tap_at_http --deploy" essentially assumes mainstream.http is a Kafka topic (or RabbitMQ topic).
In this case, mainstream is the stream name and you're TAP'ing at http source application, which equates to :mainstream.http.
This is analogous to tap:stream:foo in Spring XD. By default, Spring XD assumes the producer if there's only in the stream. You'd have to specify it when you TAP at the processor, though.
In SCDF, we require it specifically to make it more descriptive and the DSL is easy to follow as well.

"statsd" prefix on data when using Telegraf with StatsD

I am using Telegraf as a server to collect StatsD data from Python and send it to InfluxDB. However, the name values I receive in InfluxDB have the prefix "statsd_". How can I remove it?
In python I do:
ctr_name = 'foo'
client = statsd.StatsClient('MY_DOMAIN.com', 8125)
client.incr(ctr_name, 1)
And then in InfluxDB I see:
> show measurements
name: measurements
------------------
name
statsd_foo
By default, the statsd python package does not put a prefix in front of your measurements.
Below is the default
client = statsd.StatsClient('MY_DOMAIN.com', 8125, prefix=None)
I am pretty sure you set prefix="statsd" in your actual code
The problem does not seem to come from the statsd plugin configuration in Telegraf since it doesnt offer to prefix all measurements with a string constant. See here

How to monitor elasticsearch using nagios

I would like to monitor elasticsearch using nagios.
Basiclly, I want to know if elasticsearch is up.
I think I can use the elasticsearch Cluster Health API (see here)
and use the 'status' that I get back (green, yellow or red), but I still don't know how to use nagios for that matter ( nagios is on one server and elasticsearc is on another server ).
Is there another way to do that?
EDIT :
I just found that - check_http_json. I think I'll try it.
After a while - I've managed to monitor elasticsearch using the nrpe.
I wanted to use the elasticsearch Cluster Health API - but I couldn't use it from another machine - due to security issues...
So, in the monitoring server I created a new service - which the check_command is check_command check_nrpe!check_elastic. And now in the remote server, where the elasticsearch is, I've editted the nrpe.cfg file with the following:
command[check_elastic]=/usr/local/nagios/libexec/check_http -H localhost -u /_cluster/health -p 9200 -w 2 -c 3 -s green
Which is allowed, since this command is run from the remote server - so no security issues here...
It works!!!
I'll still try this check_http_json command that I posted in my qeustion - but for now, my solution is good enough.
After playing around with the suggestions in this post, I wrote a simple check_elasticsearch script. It returns the status as OK, WARNING, and CRITICAL corresponding to the "status" parameter in the cluster health response ("green", "yellow", and "red" respectively).
It also grabs all the other parameters from the health page and dumps them out in the standard Nagios format.
Enjoy!
Shameless plug: https://github.com/jersten/check-es
You can use it with ZenOSS/Nagios to monitor cluster health, data indices, and individual node heap usage.
You can use this cool Python script for monitoring your Elasticsearch cluster. This script check your IP:port for Elasticsearch status. This one and more Python script for monitoring Elasticsearch can be found here.
#!/usr/bin/python
from nagioscheck import NagiosCheck, UsageError
from nagioscheck import PerformanceMetric, Status
import urllib2
import optparse
try:
import json
except ImportError:
import simplejson as json
class ESClusterHealthCheck(NagiosCheck):
def __init__(self):
NagiosCheck.__init__(self)
self.add_option('H', 'host', 'host', 'The cluster to check')
self.add_option('P', 'port', 'port', 'The ES port - defaults to 9200')
def check(self, opts, args):
host = opts.host
port = int(opts.port or '9200')
try:
response = urllib2.urlopen(r'http://%s:%d/_cluster/health'
% (host, port))
except urllib2.HTTPError, e:
raise Status('unknown', ("API failure", None,
"API failure:\n\n%s" % str(e)))
except urllib2.URLError, e:
raise Status('critical', (e.reason))
response_body = response.read()
try:
es_cluster_health = json.loads(response_body)
except ValueError:
raise Status('unknown', ("API returned nonsense",))
cluster_status = es_cluster_health['status'].lower()
if cluster_status == 'red':
raise Status("CRITICAL", "Cluster status is currently reporting as "
"Red")
elif cluster_status == 'yellow':
raise Status("WARNING", "Cluster status is currently reporting as "
"Yellow")
else:
raise Status("OK",
"Cluster status is currently reporting as Green")
if __name__ == "__main__":
ESClusterHealthCheck().run()
I wrote this a million years ago, and it might still be useful: https://github.com/radu-gheorghe/check-es
But it really depends on what you want to monitor. The above measures:
if Elasticsearch responds to HTTP
if ingestion rate drops under the defined levels
if total number of documents drops the defined levels
But of course there's much more that might be interesting. From query time to JVM heap usage. We wrote a blog post about the most important ones here: https://sematext.com/blog/top-10-elasticsearch-metrics-to-watch/
Elasticsearch has APIs for all these, so you may be able to use a generic check_http_json to get the needed metrics. Alternatively, you may want to use something like Sematext Monitoring for Elasticsearch, which gets these metrics out of the box, then forward threshold/anomaly alerts to Nagios. (disclosure: I work for Sematext)

Resources