otherwise working munin not showing graph when node is offline - monitoring

I have a working munin setup that is plotting graphs for around 50 machines. The machines are connected via a somewhat brittle umts connection. Whenever one of the machines goes down the (historic) graphs are replaced by "Image not found"-placeholders. This eliminates the graphs as a source of information on why the machine went down.
On the munin irc i was told that this is expected behavior and that I should run munin-asyncd to buffer the values. I don't understand however how this would solve my problem. As far as I understand it would introduce some delay but not fundamentally change anything.
Also the graphs are still in ''/var/lib/munin'' but copying them from there every time and plotting manually with rrdtool is cumbersome. So is there some configuration switch to make munin show the graphs even when the machine down?

Related

Influxdb high CPU usage jumping to 80 %?

I am relatively new to time series db world . I am running a Influxdb 1.8.x as a docker container, and I have configured the influxdb.conf file as a default config. Currently I am facing a issue of high CPU usage by influxdb, the CPU jumps to 80 to 90% and creating a problem for other process running on same machine.
I have tried a solution given here ->> Influx high CPU issue but unfortunately It did not work? I am unable to understand the reason behind the issue and also struggling to get support in terms of documentation or community help.
What I have tried so far:
updated the monitor section of influxdb.conf file like this ->> monitor DB
Checked the series cardinality SHOW SERIES CARDINALITY and it looks well within limits--9400(I am also not sure about the ideal number for high cardinality red flag)
I am looking for an approach, which will help me understand this problem the root cause?
Please let me know if you need any further information on same.
After reading about Influxdb debug and CPU profiling HTTP API influxdb I was able to pin-down the issue, the problem was in the way I was making the query, my query involved more complex functions and also GROUP BAY tag.I also tried query analysis using EXPLAIN ANALYZE (query) command to check how much time a query is taking to execute. I resolved that and noticed a huge Improvement in CPU load.
Basically I can suggest the following:
Run the CPU profile analysis using influxdb HTTP API with the command curl -o <file name> http://localhost:8086/debug/pprof/all?cpu=true e and collect result.
Visualize the result using tool like Pprof tool and find the problem
Also one can run basic commands like SHOW SERIES CARDINALITY and EXPLAIN ANALYZE <query> to understand the execution of the query
Before designing any schema and Influx client check the hardware recommendation ->> Hardware sizing guidelines

Fetch data subset from gmond

This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.

Why is membase server so slow in response time?

I have a problem that membase is being very slow on my environment.
I am running several production servers (Passenger) on rails 2.3.10 ruby 1.8.7.
Those servers communicate with 2 membase machines in a cluster.
the membase machines each have 64G of memory and a100g EBS attached to them, 1G swap.
My problem is that membase is being VERY slow in response time and is actually the slowest part right now in all of the application lifecycle.
my question is: Why?
the rails gem I am using is memcache-northscale.
the membase server is 1.7.1 (latest).
The server is doing between 2K-7K ops per second (for the cluster)
The response time from membase (based on NewRelic) is 250ms in average which is HUGE and unreasonable.
Does anybody know why is this happening?
What can I do inorder to improve this time?
It's hard to immediately say with the data at hand, but I think I have a few things you may wish to dig into to narrow down where the issue may be.
First of all, do your stats with membase show a significant number of background fetches? This is in the Web UI statistics for "disk reads per second". If so, that's the likely culprit for the higher latencies.
You can read more about the statistics and sizing in the manual, particularly the sections on statistics and cluster design considerations.
Second, you're reporting 250ms on average. Is this a sliding average, or overall? Do you have something like max 90th or max 99th latencies? Some outlying disk fetches can give you a large average, when most requests (for example, those from RAM that don't need disk fetches) are actually quite speedy.
Are your systems spread throughout availability zones? What kind of instances are you using? Are the clients and servers in the same Amazon AWS region? I suspect the answer may be "yes" to the first, which means about 1.5ms overhead when using xlarge instances from recent measurements. This can matter if you're doing a lot of fetches synchronously and in serial in a given method.
I expect it's all in one region, but it's worth double checking since those latencies sound like WAN latencies.
Finally, there is an updated Ruby gem, backwards compatible with Fauna. Couchbase, Inc. has been working to add back to Fauna upstream. If possible, you may want to try the gem referenced here:
http://www.couchbase.org/code/couchbase/ruby/2.0.0
You will also want to look at running Moxi on the client-side. By accessing Membase, you need to go through a proxy (called Moxi). By default, it's installed on the server which means you might make a request to one of the servers that doesn't actually have the key. Moxi will go get it...but then you're doubling the network traffic.
Installing Moxi on the client-side will eliminate this extra network traffic: http://www.couchbase.org/wiki/display/membase/Moxi
Perry

Detecting end-user connection speed problems in Apache for Windows

Our company provides web-based management software (servicedesk, helpdesk, timesheet, etc) for our clients.
One of them have been causing a great headache for some months complaining about the connection speed with our servers.
In our individual tests, the connection and response speeds are always great.
Some information about this specific client :
They have about 300 PC's on their local network, all using the same bandwith/server for internet access.
They dont allow us to ping their server, so we cant establish a trace route.
They claim every other site (google, blogs, news, etc) are always responding fast. We know for a fact they have no intention to mislead us and know this to be true.
They might have up to 100 PC's simulateneously logged in our software at any given time. They have a need to increase that amount up to 300 so this is a major issue.
They are helpfull and colaborative in this issue we are trying to resolve for a long time.
Some information about our server and software :
We have been able to allocate more then 400 users at a single time without major speed losses for other clients.
We have gone extensive lengths to make good use of data caching and opcode caching in the software itself, and we did notice the improvement (from fast to faster)
There are no database, CPU or memory bottlenecks or leaks. Other clients are able to access the server just fine.
We have little to no knowledge on how to do some analyzing on specific end-user problems (Apache running under Windows server), and this is where I could use a lot of help.
Anything that might be related to Apache configuration would also be helpfull.
While all signs points to it being an internal problem in this specific client network, we are dedicating this effort to solve that too, if that is the case, but do not have capable or instructed professionals to deal with network problems (they do, however, while their main argument is that 'all other sites are fast, only yours is slow')
you might want to have a look at the tools from google "page speed family": http://code.google.com/speed/page-speed/docs/overview.html
your customer should maybe run the page speed extension for you. maybe then you can find out what is the problem: http://code.google.com/speed/page-speed/docs/extension.html

How to specifically identify server issues from a load test result (using LoadRunner)?

How do you isolate a performance issue to a specific component of the application infrastructure? Specifically, are there distinct markers in the result logs that distinguish between bottlenecks at web, application and/or database server levels?
I was asked this question in an interview and went blank on it. Seems this information is not available anywhere.
In addition to SiteScope and other agentless monitoring of system components, you need to make sure your scenario and scripts are working as expected. This includes proper error checking and use of transactions (and a host of other things). If the transactions are granular enough, this will give you insight into at least the requests that have performance issues. Once you have these indicators, work with the infrastructure team to review logs and other information. Being an iterative process, tests can be made to focus on a smaller and smaller section of the infrastructure.
In addition, loadrunner scripts don't have to be made strictly 'coming in through the frontdoor'. If you have a multi-tiered system, scripts can be made to hit the web/app/database servers directly.
For what to look for, focus on any measurements that have 'knees' or 'hockey stick' type of behaviour. You can hook into any of the server resource type measurements directly in the controller and integrate other team's stats in the analysis phase. Compare with benchmarks at lower virtual user levels to determine what is acceptable and unacceptable.
Good luck!
If the interview is focused on LoadRunner and SiteScope is considered - I'd come to conclusion that it's more focused on HP/Mercury solutions.. In that case I'd suggest you to look into HP Diagnostics and it's LoadRunner integration capabilities.
This type of information is usually not available by just looking at the standard results from a performance test.
Parts of the information you are looking for MAY be found by using SiteScope to monitor all the relevant servers in the test. SiteScope offers many counters to look at such as CPU, Memory, Disk I/O and Network I/O - as seen on each server.
This information perhaps gives clues as to where the bottleneck is, and the more counters you add to SiteScope, the bigger the change to pinpoint the bottleneck.
It is a very common misconception that AppServer and DBServer bottlenecks could be identified by just looking at the raw response times or hits, pages etc (web protocol), unless of course the URI accessed defines the exact component(s) in the system...

Resources