How to fix the stuck problem after the nni experiment start? - machine-learning

I have create a nni experiment by:
nnictl create --config config.yml
The content of config.yml file is:
maxExecDuration: 12h
maxTrialNum: 500
#choice: local, remote, pai
trainingServicePlatform: local
searchSpacePath: search_space.json
#choice: true, false
useAnnotation: false
multiThread: true
logDir: "/path/of/logdir/"
tuner:
#choice: TPE, Random, Anneal, Evolution, BatchTuner, MetisTuner, GPTuner
#SMAC (SMAC should be installed through nnictl)
builtinTunerName: TPE
classArgs:
#choice: maximize, minimize
optimize_mode: maximize
trial:
command: python main.py
codeDir: .
gpuNum: 3
localConfig:
gpuIndices: "1,2,3"
maxTrialNumPerGpu: 4
useActiveGpu: true
After a few minutes, the web interface of nni is stuck.
It shows nothing if I refresh the web interface.
I think maybe the problem is come up with maxTrialNumPerGpu, because I haven't gotten this problem when I haven't set maxTrialNumPerGpu.
Can anyone help me?

Related

Cypress Time out not respecting the value given on command

Since our test agents are slow some times- i am trying to add some additional time outs for some commands
I did it like using time out value on command as shown below.But its not respecting the value given
My understanding is cypress will wait for "10000" MS for getting the #Addstory element?
Can any one advice is this is the correct way please?
Thank you so much
cy.get('#addstory > .ng-scope').click({ timeout: 10000 })
In cypress.json file, increase the timeout to 10 seconds or what ever timeout you want like this: "defaultCommandTimeout": 10000 and save the file. Now close the app and open it again. Navigate to Settings > Configuration you should be able to see the new value set for defaultCommandTimeout.
I issue was i was adding time out on click not for getting the element when i changed like below -All good waiting for add story to be visible as i expected before click
cy.get('#addstory > .ng-scope',{ timeout: 10000 }).click()

Closing a "local" OrientDB when using connection pools

So I basically do this.
OObjectDatabaseTx result = OObjectDatabasePool.global().acquire( "local:orientdb", "admin", "admin");
//dostuff
result.close;
The problem is that when I redeploy my webapp (without restarting the Java EE container) I get the folling error:
com.orientechnologies.orient.core.exception.OStorageException: Cannot open local storage 'orientdb' with mode=rw
which I interpret to mean "Tomcat still has a filelock from the last app".
So my question is how do I cleanly exit in this scenario? I've tried:
OObjectDatabasePool.global().close()
and
new OObjectDatabaseTx("local:orientdb").close()
but neither seem to work. Any ideas? The documentation isn't exactly clear on this issue.
Set the property "storage.keepOpen" to false:
java ... -Dstorage.keepOpen=false ...
or via Java code:
OGlobalConfiguration.STORAGE_KEEP_OPEN.setValue( false );

Grails development enviornment page loads very slow

So I'm getting page load times in the range of 30-45 seconds.
Some history:
This was not always the case for this project. This project is in production so I haven't really touched the code in a while. I noticed it started happening the last time I was updating the code. I don't recall anything specific that I changed that should have anything to do with the problem. I have other projects that are running with the same Grails versions with no problem.
I think it started happening in 2.2.3. I am now running 2.2.4.
I am using x64 JDK 1.7.0_25, Windows 7 x64.
I'm not sure what else to put here that would be relevant. Any assistance is appreciated!
Edit: running with -noreloading has no effect.
Edit2: I've tried deleting my .grails folder entirely, running clean, and deleting my target folder and stacktrace log.
Edit3: It does seem that the amount of time it takes is dependent on the amount of data displayed/read. Small pages take 3-4 seconds. Medium pages 10-12 seconds...
Edit4: I'm running it via IntelliJ IDEA 12.1.4 x64 (idea64.exe). I've also tried it outside of IntelliJ with the same results.
Edit5: The database is Oracle enterprise that supports the entire company. It is managed by full time adminstrators. This isn't a MySQL server on my local machine.
Edit6: The application also functions normally when deployed in TEST (test war), but still is slow when ran with test run-app.
Starting to get somewhere:
I downloaded JDK 1.7.21 and ran the app with that and it started working no problem! I then ran clean which triggered a recompile and it stopped working... grr
Now with 1.7.21 still active, I tried -noreloading and it works!
Annnd... now it works even if I don't use -noreloading..........
I've gone back to 1.7.25.. ran clean, and it works. Sooooooo yeah... explain that.
And now it doesn't anymore.
This is under Linux but will maybe useful:
If you are running the code within an IDE:
ps auwx|grep java
-Dgrails.console.class=grails.build.logging.GrailsEclipseConsole -Dosgi.requiredJavaVersion=1.6 -Xms40m -Xmx768m -XX:MaxPermSize=256m -
As you can see the memory settings Xms and Xmx are quite low...
In your IDE there should be an INI file:
more STS.ini
1 -vm
2 /usr/lib/jvm/java-6-openjdk-amd64/bin/java
3 -startup
4 plugins/org.eclipse.equinox.launcher_1.3.0.v20120522-1813.jar
5 --launcher.library
6 plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20120913-144807
7 -product
8 org.springsource.sts.ide
9 --launcher.defaultAction
10 openFile
11 -vmargs
12 -Dgrails.console.enable.interactive=false
13 -Dgrails.console.enable.terminal=false
14 -Djline.terminal=jline.UnsupportedTerminal
15 -Dgrails.console.class=grails.build.logging.GrailsEclipseConsole
16 -Dosgi.requiredJavaVersion=1.6
17 -Xms40m
18 -Xmx768m
19 -XX:MaxPermSize=256m
You can up these value and try restarting your IDE...
I would also suggest you run something like nmon before/during and monitor whilst the code is running and monitor disk/cpu/network throughputs.
You may find you are hammering your dev box which is causing the issue.
If the production is fine I really don't see what the problem is..
E2A Ahhh forgot it was under windows so no nmon for windblows but hey not that I tried it - http://sourceforge.net/projects/jnmonanalyser/
E2A again:
1. Enable DataSource.groovy debugging:
dataSource {
pooled = true
driverClassName ="com.mysql.jdbc.Driver"
username = "aaa"
password = "aaaa"
//SQL Logging - refer to Config.groovy at hibernate.sql now
logSql=true
...
config.groovy - this will stop your app from running if you have issues with lets say records you are trying to add in your BootStrap
// Return error when it fails
//grails.gorm.failOnError=true
Enable log4j and use this or part of it:
// log4j configuration
log4j {
appender.stdout = "org.apache.log4j.ConsoleAppender"
appender.'stdout.layout'="org.apache.log4j.PatternLayout"
appender.'stdout.layout.ConversionPattern'='[%r] %c{2} %m%n'
appender.stacktraceLog = "org.apache.log4j.FileAppender"
appender.'stacktraceLog.layout'="org.apache.log4j.PatternLayout"
appender.'stacktraceLog.layout.ConversionPattern'='[%r] %c{2} %m%n'
appender.'stacktraceLog.File'="stacktrace.log"
appender.'stacktraceLog.MaxFileSize'="1MB"
rootLogger="error,stdout"
logger {
grails="error"
StackTrace="error,stacktraceLog"
org {
codehaus.groovy.grails.web.servlet="error" // controllers
codehaus.groovy.grails.web.pages="error" // GSP
codehaus.groovy.grails.web.sitemesh="error" // layouts
codehaus.groovy.grails."web.mapping.filter"="error" // URL mapping
codehaus.groovy.grails."web.mapping"="error" // URL mapping
codehaus.groovy.grails.commons="info" // core / classloading
codehaus.groovy.grails.plugins="error" // plugins
codehaus.groovy.grails.orm.hibernate="error" // hibernate integration
// Hibernate should be on - if you want to catch sql logs
springframework="off"
hibernate="on"
//hibernate.SQL = 'debug'
//hibernate.type = 'trace'
//hibernate.SQL = 'info,hibernate'
//hibernate.type = 'info,hibernate'
//hibernate = 'info,hibernate'
//apache.commons.digester.Digester = 'debug,javaclasses'
}
}
additivity.StackTrace=false
}
try and capture what it is doing, it is also worth running developer tools on your browser whether its firefox of chrome and trying to figure out on what elements it is taking that time - but between the logs and the browser developer tools should lie your answer.
Usually you can fix this by doing
grails clean
on the grails command line (I open it via CRTL+ALT+G in IntelliJ IDEA).
This erases all compiled files and will recompile your project from scratch (afaik), which usually erases errors like that. This is not a real fix for the underlying problem, but it solves the problem. Grails is highly experimental and unstable if you ask me, i have a lot of weird error that usually disappear when doing a clean. Btw i'm using 2.1.5 on Windows 7 x64, too.
Delete stacktrace file in the target folder of your project. It can
get huge. (At present mine is 48 GB).
Check if there is enough space in your C directory.
If you are hot swapping code, then page loads can get slow. So in such cases, restart the dev server (grails app).
Sometimes, requests to the server can hang, where focusing (left or right clicking on the cmd) on the command prompt seems to skip the pause. (weird)
Increasing the JVM permgen, heap spaces depending on your memory might help as well.
Try running the server using command prompt rather within an IDE.
Better use methods for actions than closures.
For a system with 3GB RAM, my environment variable setting is:
JAVA_OPTS
-Xms512m -Xmx1g
The STS.ini settings:
-startup
plugins/org.eclipse.equinox.launcher_1.2.0.v20110502.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_1.1.100.v20110502
-product
com.springsource.sts.ide
--launcher.defaultAction
openFile
--launcher.XXMaxPermSize
384M
-vmargs
-Dosgi.requiredJavaVersion=1.5
-Xmn128m
-Xms1024m
-Xmx1024m
-Xss2m
-XX:PermSize=256m
-XX:MaxPermSize=512m
-XX:MaxGCPauseMillis=10
-XX:MaxHeapFreeRatio=50
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:+UserParallelGC
8) Maybe the problem is with the JDK and grails versions combination. There seems to be an error with OpenJDK 1.7u25 and spring loaded. Okay, you are not using OpenJDK, but try with other version anyway. Try with JDK1.7u03.
9) Try JVM with -server flag, and see if it improves runtime performance.
grails run-app -server
So the reason why this was happening:
JDK 1.7.25

Error "Invalid Locales set !!" when trying to install sqldemo on Informix

I am extremely new to Informix and am having some trouble trying to get sqldemo installed.
Set up so far:
openSuse 12.1 (32 bit)
Informix Growth Edition 11.70 UC6
Informix SQL Developer 7.50 UC6
Informix RDS 7.50 UC6
Informix ID 7.50 UC6
After struggling a few days and a lot of reading of http://publib.boulder.ibm.com/infocenter/idshelp/v117/index.jsp, I managed to get Informix installed and On-line.
I also opted to install the demo database instance that comes with the installation.
I now and attempting to get started with Informix 4GL by Example.
I am trying to get the sqldemo database up. I don't know if it will replace the previous instance installed with Informix, but that is a different problem.
Right now as per the document, running the following should set up the DB:
sqldemo stores2t -log
I however get an error: "Invalid Locales set !!".
I have tried looking up this error and also in the documentation.
I have tried setting the CLIENT_LOCALE and DB_LOCALE in my .profile file.
For example:
export CLIENT_LOCALE=en_US.CP1252 and
export DB_LOCALE=en_US.819
This has not helped.
A push in the right direction, or perhaps some other documentation I could read that would explain things better would really be appreciated.
If any other information is required from me, please do not hesitate to ask.
Update 1
Thanks so much for the response.
A couple of things firstly that I have tried since your post.
Changed the the CLIENT_LOCALE and DB_LOCALE as you specified - Same error - So i removed it as you said it should not be set.
Fixed a problem in my PATH and made sure it has /usr/informix/bin - Same Error
INFORMIXDIR is /usr/informix
INFORMIXSERVER is ol_informix1170 (This is from the database that was installed with the informix install, don't know if this must be changed? and if so to what?)
Ran the script you mentioned, result :
INFORMIXDIR=/usr/informix
INFORMIXSERVER=ol_informix1170
INFORMIXSQLHOSTS=/usr/informix/etc/sqlhosts
LANG=en_US.UTF-8
ONCONFIG=onconfig
I noticed I had set the language to UK, which made the Locales en_gb instead if en_us, so tried changing that in my .profile, which did not help, so also tried changing the language to US and the locales to en_us, but this made no difference.
As for what you said about the sqldemo script and the already installed db, It is fine if that db is removed as this is just a test VB box for me to learn on.
Could the $INFORMIXSERVER set as ol_informix1170 be the problem?
Thank you once again for the help.
Neill
Update 2
Thanks again for the response.
A few things to note.
The dbenv results I posted is all that shows which i assume/presume (uh-oh) means that the other environment variables are not set. Which of the environment variables you posted are absolutely necessary for it to work?
As above, Where would I find the terminfo file, or does this need to be created?
As above, the SQLEXEC variable... where would I find sqlrm? I can somewhat remember from the documents I have read I think it should be $INFORMIXDIR/lib? but I only have an esql directory. Is this correct.
Barring that something in the first 3 above is not causing more problems, when trying your suggestion of DEMOPATH=en_us/0333 sqldemo stores2t -log I receive the following error:
Sorry, cannot read the mkstores3 program required to build the demonstration database. Check the /etc subdirectory of INFORMIXDIR (/usr/informix).
Checking /usr/informix/etc shows indeed that there is no mkstores3 file.
Attempting your further note of isqldemo, I get the following error:
/usr/informix/bin/isqldemo: line 58: /usr/informix/demo/sql/en_us/e01c/isqldemo: No such file or directory.
I guess this makes perfect sense as there is no e01c directory, just the 0333 directory.
Right now anything you can tell me would indeed be a consolation because my newb-ness to generally Linux and definately Informix is a big factor. Interesting that this bug has been around for so long. I guess way more experienced folk than I figured out how solve it on their own, or just never bothered with the sqldemo.
I guess that will teach me to read this:
INFORMIX-4GL by Example
Version 4.1
July 1991
Going to check now if any updated text exists, but would still appreciated more help in solving this problem. Do you think reverting to a previous snapshot before Informix was installed and not opting for the ol_informix1170 database to be included could be a possible solution? I wouldn't really see that it would be, but what do I know.
Many many thanks for your continued time and effort.
Regards,
Neill
Update 3
So I see indeed the document I was reading is ancient. I have found an updated one (2002) which uses a different script (dbaccessdemo7).
I tried running that, have run into an error, but tomorrow is another day.
For now I am going to mark this as solved because of the bug detected and resolved. I am not going to put more time and effort into sqldemo.
Thank you so much, and if I struggle with dbaccessdemo 7, I will post a new question.
Regards,
Neill
The sqldemo script won't create a new server; it may clobber your existing database (a single server may house multiple databases; indeed, there are 4 sys* databases created when a server is initialized) but it won't harm your server otherwise.
Probable cause of the error
The normal problem with invalid locales is that you've not set $INFORMIXDIR. You need $INFORMIXDIR set unless /usr/informix is (a symlink to) the correct location. You also need $INFORMIXSERVER set, and you usually need $INFORMIXDIR/bin on $PATH. Strictly, $INFORMIXSERVER is the only mandatory variable; in practice, you worry about the other two too.
The $INFORMIXDIR setting is used to locate the locale information (which is found in $INFORMIXDIR/gls) and the message files (which are found in $INFORMIXDIR/msg).
Note that CP1252 is a Windows code page. Normally on Unix, you'd either not set CLIENT_LOCALE or DB_LOCALE, or you could set them to:
export CLIENT_LOCALE=en_us.8859-1
export DB_LOCALE=en_us.8859-1
or you can choose another more appropriate (to you) locale. The 8859-15 locale includes the Euro symbol, for example, or the utf-8 locale dictates UTF-8 in the database. But, for initial debugging, stick with the 8859-1 locale, aka 819 or 0333 (all based on the IBM CCSID). If it doesn't work with 8859-1, then we have one set of problems; if it works with 8859-1 but not some other codeset or locale, then we have a different set of problems.
Follow-up info if the solution above fails
If that isn't the trouble, then I'll ask for some more details — notably, your Informix environment as reported by the dbenv script below:
: "#(#)$Id: dbenv.sh,v 2.11 2007/09/02 00:18:58 jleffler Exp $"
#
# Printout INFORMIX database environment
informix1="DB[^=]|DELIMIDENT=|SQL|ONCONFIG|TBCONFIG|INFOR"
informix2="ARC_|CLIENT_LOCALE=|GL_|GLS8BITSYS|CC8BITLEVEL|ESQL|FET_BUF_SIZE="
informix3="INF_ROLE_SEP=|NODEFDAC=|ONCONFIG|OPTCOMPIND|PDQ|PSORT"
informix4="PLCONFIG|SERVER_LOCALE|FGL|C4GL|NE_"
informix5="TCL_LIBRARY|TK_LIBRARY"
informix="$informix1|$informix2|$informix3|$informix4|$informix5"
system="COLLCHAR=|LANG=|LC_|LD_LIBRARY_PATH(_64)?=|PATH=|SHLIB_PATH="
jlss="IXD(32|64)?="
env |
egrep "^($informix|$system|$jlss)" |
sort
It's an old script; that's why the shebang is missing.
Second set of diagnosis
I was hoping for the complete output of the dbenv script; it is surprising how often something shows up. However, given what you've said, it is likely to be OK.
The INFORMIXSERVER setting sounds fine.
I'm struck by the LANG=en_US.UTF-8 setting; Informix does pay attention to $LANG and the $LC_* variables (that's why dbenv prints those out). That may be a factor in the problem. However, I would have expected CLIENT_LOCALE and SERVER_LOCALE to deal with that if it was the problem. Also, on my Mac, I have LANG=en_US.UTF-8 and yet I can connect to (8859-1) databases OK.
This is beginning to look like an install problem...or sqldemo problem...
I transitioned from a Mac to a RHEL 5 (archaic) x86/64 machine, and tried running sqldemo over there:
$ dbenv
DBDATE=Y4MD-
DBEDIT=vim
INFORMIXDIR=/work4/informix/tools-7.50.FC4
INFORMIXSERVER=toru_31
INFORMIXSQLHOSTS=/work4/informix/ids-11.70.FC4/etc/sqlhosts
INFORMIXTERM=terminfo
IXD64=/work4/informix/ids-11.70.FC4
IXD=/work4/informix/tools-7.50.FC4
IXH=/work4/informix/ids-11.70.FC4/etc/sqlhosts
IXO=/work4/informix/ids-11.70.FC4/etc/onconfig.toru_31
IXS=toru_31
LANG=en_US.UTF-8
LD_LIBRARY_PATH=/lib64:/usr/lib64:/work4/informix/tools-7.50.FC4/lib:/work4/informix/tools-7.50.FC4/lib/tools:/work4/informix/tools-7.50.FC4/lib/esql:/work4/informix/ids-11.70.FC4/lib:/work4/informix/ids-11.70.FC4/lib/esql:/work4/informix/ids-11.70.FC4/lib/cli
ONCONFIG=onconfig.toru_31
PATH=/work4/informix/tools-7.50.FC4/bin:.:/work4/jleffler/bin:/u/jleffler/bin:/work4/informix/ids-11.70.FC4/bin:/u/jleffler/linux/x86_64/bin:/work4/informix/11.70.FC1:/usr/atria/bin:/work4/jleffler/perl/v5.12.1/bin:/usr/bin:/bin:/usr/X11R6/bin:/atria_release/cm_dist/vobs/imitools/bin:/opt/rational/clearcase/bin:/opt/rational/clearquest/bin
SQLCMDLOG=/work4/jleffler/.sqlcmdlog
SQLEXEC=sqlrm
TERMINFO=/work4/jleffler/terminfo
TERM=xterm-color
$ sqldemo st2 -log
Invalid Locales set !!
$
Oh yeah? No; my locales are fine, thank you!
Well, so be it...I can reproduce your problem! That's step 1. Step 2 is to look at the expletive deleted script.
PRODUCT=sql
DEMOFILE=sqldemo
DEFLANG=en_US.8859-1
INFORMIXDIR=${INFORMIXDIR:=/usr/informix}
INFENV=$INFORMIXDIR/bin/infenv
CONVLOC=$INFORMIXDIR/bin/convloc
if [ $# -gt 0 -a "X$1" = "X-e" ] ; then
LOCALE=$DEFLANG # -e means en_US.8859-1 required
shift
else
LOCALE=`$INFENV DBLANG` # get DBLANG value
if [ "x${LOCALE}" = "x" ]; then
LOCALE=`$INFENV CLIENT_LOCALE` # try CLIENT_LOCALE instead
if [ "x${LOCALE}" = "x" ]; then
LOCALE=`$INFENV DB_LOCALE` # finally default to DB_LOCALE
fi
fi
fi
if [ "x${LOCALE}" = "x" ]; then
LOCALE=$DEFLANG # finally default to DB_LOCALE
fi
export LOCALE
if [ "x${DEMOPATH}" = "x" ]; then
echo "Invalid Locales set !!"
else
exec $INFORMIXDIR/demo/$PRODUCT/$DEMOPATH/$DEMOFILE $*
fi
exit $?
Note that test for ${DEMOPATH}; note that DEMOPATH is not set in the script. So, we've got to get it set. What to? Well, ls $INFORMIXDIR/demo/sql shows that there are various locale-specific sub-directories (en_us,
ja_jp,
ko_kr,
th_th,
zh_cn,
zh_tw) and under the en_us directory there's 0333 (only).
Please run:
DEMOPATH=en_us/0333 sqldemo stores2t -log
This more or less worked for me — I believe it would work for you. I have a slightly unusual setup in that I have just I4GL (p-code and c-code) and ISQL in the $INFORMIXDIR; the server is run out of a different directory. This means I don't have server utility programs like dbload (specifically) in $INFORMIXDIR/bin. When the sqldemo script tried to load the data with dbload, therefore, it failed for me. It would work for you because you have all the Informix software in a single directory. To add insult to injury, it runs the dbload program by explicit path, so I can't futz my PATH to make it available.
This should get you going. I have a bug to report...it is CQ idsdb00244894.
I'm sorry that you ran into so much trouble. You shouldn't have done so.

How to monitor elasticsearch using nagios

I would like to monitor elasticsearch using nagios.
Basiclly, I want to know if elasticsearch is up.
I think I can use the elasticsearch Cluster Health API (see here)
and use the 'status' that I get back (green, yellow or red), but I still don't know how to use nagios for that matter ( nagios is on one server and elasticsearc is on another server ).
Is there another way to do that?
EDIT :
I just found that - check_http_json. I think I'll try it.
After a while - I've managed to monitor elasticsearch using the nrpe.
I wanted to use the elasticsearch Cluster Health API - but I couldn't use it from another machine - due to security issues...
So, in the monitoring server I created a new service - which the check_command is check_command check_nrpe!check_elastic. And now in the remote server, where the elasticsearch is, I've editted the nrpe.cfg file with the following:
command[check_elastic]=/usr/local/nagios/libexec/check_http -H localhost -u /_cluster/health -p 9200 -w 2 -c 3 -s green
Which is allowed, since this command is run from the remote server - so no security issues here...
It works!!!
I'll still try this check_http_json command that I posted in my qeustion - but for now, my solution is good enough.
After playing around with the suggestions in this post, I wrote a simple check_elasticsearch script. It returns the status as OK, WARNING, and CRITICAL corresponding to the "status" parameter in the cluster health response ("green", "yellow", and "red" respectively).
It also grabs all the other parameters from the health page and dumps them out in the standard Nagios format.
Enjoy!
Shameless plug: https://github.com/jersten/check-es
You can use it with ZenOSS/Nagios to monitor cluster health, data indices, and individual node heap usage.
You can use this cool Python script for monitoring your Elasticsearch cluster. This script check your IP:port for Elasticsearch status. This one and more Python script for monitoring Elasticsearch can be found here.
#!/usr/bin/python
from nagioscheck import NagiosCheck, UsageError
from nagioscheck import PerformanceMetric, Status
import urllib2
import optparse
try:
import json
except ImportError:
import simplejson as json
class ESClusterHealthCheck(NagiosCheck):
def __init__(self):
NagiosCheck.__init__(self)
self.add_option('H', 'host', 'host', 'The cluster to check')
self.add_option('P', 'port', 'port', 'The ES port - defaults to 9200')
def check(self, opts, args):
host = opts.host
port = int(opts.port or '9200')
try:
response = urllib2.urlopen(r'http://%s:%d/_cluster/health'
% (host, port))
except urllib2.HTTPError, e:
raise Status('unknown', ("API failure", None,
"API failure:\n\n%s" % str(e)))
except urllib2.URLError, e:
raise Status('critical', (e.reason))
response_body = response.read()
try:
es_cluster_health = json.loads(response_body)
except ValueError:
raise Status('unknown', ("API returned nonsense",))
cluster_status = es_cluster_health['status'].lower()
if cluster_status == 'red':
raise Status("CRITICAL", "Cluster status is currently reporting as "
"Red")
elif cluster_status == 'yellow':
raise Status("WARNING", "Cluster status is currently reporting as "
"Yellow")
else:
raise Status("OK",
"Cluster status is currently reporting as Green")
if __name__ == "__main__":
ESClusterHealthCheck().run()
I wrote this a million years ago, and it might still be useful: https://github.com/radu-gheorghe/check-es
But it really depends on what you want to monitor. The above measures:
if Elasticsearch responds to HTTP
if ingestion rate drops under the defined levels
if total number of documents drops the defined levels
But of course there's much more that might be interesting. From query time to JVM heap usage. We wrote a blog post about the most important ones here: https://sematext.com/blog/top-10-elasticsearch-metrics-to-watch/
Elasticsearch has APIs for all these, so you may be able to use a generic check_http_json to get the needed metrics. Alternatively, you may want to use something like Sematext Monitoring for Elasticsearch, which gets these metrics out of the box, then forward threshold/anomaly alerts to Nagios. (disclosure: I work for Sematext)

Resources