How to monitor windows service and process with zabbix - monitoring

I am new in Zabbix and I am using Zabbix 3.4 version. I have installed server on Linux and want to monitor and check status of Windows service using its Windows agent.
I got the status of services using the key below
service.info[<serviceName>,state]
It returns me proper status of service. Now I want to check how much CPU is utilized by process and how much memory is utilized by process.
I tried some of keys but it's not returning proper value.
perf_counter[\Process(<processName>)\% User Time] // to get CPU utilization by process
proc_info[<processName>,wkset] // to get memory utilize by process
system.cpu.util[,system,avg5] // to get total CPU utilization
vm.memory.size[available] // to get total RAM utilization
But none of above working properly. I tried other keys also but agent logs say it's unsupported. I checked forum and searched on Google but nothing found.

Usually there isn't a direct match Windows Service -> Specific process.
Any service spawns N processes for its internals and also can spawn additional processes to manage incoming connection, log requests and so on.
Think about a classic httpd server: you should find at least one master process, various pre-forked server processes and php/php-fpm processes for current requests.
Regarding the keys you provided, what do you mean by "not working properly" ?
You can refer to Zabbix documentation for Windows-specific items for the exact syntax of the items and the meaning of the return values.

You can use Zabbix item for CPU Utilization of average 5 min:
system.cpu.util[,,avg5]
This will give you average usage of CPU per 5 mins on Windows server. You can then create an appropriate trigger for the same.

Related

pgpool node detached for no reason?

I have a two-node PostgreSQL cluster running on VMs where each VM runs both the pgpool service and a Postgres server.
due to insufficient memory configuration the Postgres server crashed, so I've bumped the VM memory and the changed Postgres memory config in the postgresql.conf file. since that memory changes the slave pgpool node detaches every night at a specific time, though when looking at node_exporter metrics regarding CPU, load, processes disk usage or memory didn't show any spikes or sudden changes.
the slave node detaching happened before but not day after day. I've stumbled upon this thread and read this part of the documentation about the failover but Since the Postgres server didn't crash and existing connections to the slave node were working (it kept serving existing connections but didn't take new ones) so network issues seemed irrelevant, especially after consulting with our OPS team on whether they noticed any abnormal network or DNS activity that could explain that. Unfortunately, they didn't notice any interesting findings.
I have pg_exporter, postgres_exporter and node_exporter on each node to monitor the server and VM behavior, what should I be looking for to debug this? what should I ask of our OPS team to check specifically? our pgpool log file only states the failure to access the other node but no exact reason, as the aforementioned docs say:
Pgpool-II does not distinguish each case and just decides that the
particular PostgreSQL node is not available if health check fails.
could it still be a network\DNS issue? and if so. how would I confirm this?
thnx for reading and taking your time to assist me in this conundrum
that was interesting
If summing the gist of it,
it was part of the OPS team infrastructure backups
Now the entire process went like that:
setting the ambiance:
we run on-prem on top of VMWare vCenter cluster backing up on the infra side with VMWare VM snapshot and Veeamm VM backup where the vmdk files\ESXi datastores reside on a NetApp storage based on NFS.
when checking node exporter metrics in Node Exporter Full Dashboard I saw network traffic drop in the order of up to 2 packets per second for about 5 to 15 minutes consistently through the last few months, increasing dramatically in phenomenon length in the last month (around the same time late at night).
Rough illustration:
After checking again with our OPS team they indicated it could be the host configurations\Veeam Backups.
It turns out that because the storage for the VMs (including the one that runs the Veeam backup) is attached via network and not directly on the ESXi hosts, the final snapshot saved\consolidated at that late-night time -
node detaches every night at a specific time
With the way NFS works with disk locking (limiting IOPs to existing data) along with the high IOPs requirements from the Veeam backup cause the server to hang\freeze and sometimes on rare occasions even a VM restart. here's the quote from the Veeam issue doc:
The snapshot removal process significantly lowers the total IOPS that can be delivered by the VM because of additional locks on the VMFS storage due to the increase in metadata updates
the snapshot removal process will easily push that into the 80%+ mark and likely much higher. Most storage arrays will see a significant latency penalty once IOP's get into the 80%+ mark which will of course be detrimental to application performance.
This issue occurs when the target virtual machine and the backup appliance [proxy] reside on two different hosts, and the NFSv3 protocol is used to mount NFS datastores. A limitation in the NFSv3 locking method causes a lock timeout, which pauses the virtual machine being backed up [during snapshot removal].
Obviously, that would interfere at the very least with Postgres functionality especially configured as a cluster with replication that requires a near-constant connection between the Postgres servers. A similar thread on SO using a different DB server
a solution is suggested including solving the issue presented in the last quote in this link, though for the time being, we removed the usage of Veeam backup for sensitive VMs until the solution can be verified locally (will update in the future if tried and fixed issue)
additional incidents documentation: similar issue case, suggested solution info from Veeam, third party site solution (around the same problem as a temp fix as I see it), Reddit thread acknowledging the issue and suggesting options

Problem: empty graphics in GKE cluster node detail (No data for this time interval). How can I fix it?

I have a cluster in Google Cloud. But I need to know information about resources usage.
In interface of each node there are three graphics about CPU, memory and disk usage. But all this graphics in the each node have warning "No data for this time interval" for any time interval.
I upgraded all clusters and nodes to the latest version 1.15.4-gke.22 and changed "Legacy Stackdriver Logging" to "Stackdriver Kubernetes Engine Monitoring".
But it didn't help.
In Stackdriver Workspace there is only "disk_read_bytes" with graphics, any other requests in Metric Explorer have only message "No data for this time interval"
If I do request "kubectl top nodes" in the command line, I see current data for CPU and memory. But I need to see it on Node detail page to understand the peak load. How can I configure it?
In my case, I was missing permissions on the IAM service account associated with the cluster - make sure it has the roles:
Monitoring Metrics Writer (roles/monitoring.metricWriter)
Logs Writer (roles/logging.logWriter)
Stackdriver Resource Metadata Writer (roles/stackdriver.resourceMetadata.writer)
This is documented here
Actually it sound strange because if you can get metrics in command line and the Stackdriver interface doesn't show them maybe it's a bug.
I recommend this: if you be able, create a cluster with the minimum resources, check the same Stackdriver metrics and if there are metrics, it can be a bug and you can report it on in the appropriate GCP channel.
Check the documentation about how to get support within GCP:
Best Practices for Working with Cloud Support
Getting support for Google Cloud

Monitoring applications on Zabbix

everyone, I am trying to monitor an application for transactions on Zabbix, I have already installed the agent on the windows server where the application is running on. However, I only get stats such as CPU, disk space, networks etc. I am trying to monitor transactions and interfaces. How do I go about?
I would consider using user parameters if you want pure zabbix solution. Have your application transactions output status, and capture that. This approach can be cumbersome.
If you are not looking for pure zabbix solution I would look at a solution that supports transactional monitoring, something like http://wdt.io .

Erlang's maximum number of simultaneous open ports?

Does the erlang TCP/IP library have some limitations? I've done some searching but can't find any definitive answers.
I have set the ERL_MAX_PORTS environment variable to 12000 and configured Yaws to use unlimited connections.
I've written a simple client application that connects to an appmod I've written for Yaws and am testing the number of simultaneous connections by launch X number of clients all at the same time.
I find that when I get to about 100 clients, the Yaws server stops accepting more TCP connections and the client errors out with
Error in process with exit value: {{badmatch,{error,socket_closed_remotely}}
I know there must be a limit to the number of open simultaneous connections, but 100 seems really low. I've looked through all the yaws documentation and have removed any limit on connections.
This is on a 2.16Ghz Intel Core 2 Duo iMac running Snow Leopard.
A quick test on a Vista Machine shows that I get the same problems at about 300 connections.
Is my test unreasonable? I.e. is it silly to open 100+ connections simultaneously to test Yaws' concurrency?
Thanks.
It seems you hit a system limitation, try to increase the max number of open files using
$ ulimit -n 500
Python on Snow Leopard, how to open >255 sockets?
Erlang itself has a limit of 1024:
From http://www.erlang.org/doc/man/erlang.html
The maximum number of ports that can be open at the same time is 1024 by default, but can be configured by the environment variable ERL_MAX_PORTS.
EDIT:
The system call listen()
has a parameter backlog which determines how many requests can be queued, please check whether a delay between requests to establish connections helps. This could be your problem.
All Erlang system limits are reported in the Erlang Efficiency Guide:
http://erlang.org/doc/efficiency_guide/advanced.html#id2265856
Reading from the open ports section:
The maximum number of simultaneously
open Erlang ports is by default 1024.
This limit can be raised up to at most
268435456 at startup (see environment
variable ERL_MAX_PORTS in erlang(3))
The maximum limit of 268435456 open
ports will at least on a 32-bit
architecture be impossible to reach
due to memory shortage.
After trying out everybody's suggestion and scouring the Erlang docs, I've come to the conclusion that my problem is with Yaws not being able to keep up with the load.
On the same machine, an Apache Http Components web server (non-blocking I/O) does not have the same problems handling connections at the same thresholds.
Thanks for all your help. I'm going to move on to other erlang based web servers, like Mochiweb.

Monitoring CPU Core Usage on Terminal Servers

I have windows 2003 terminal servers, multi-core. I'm looking for a way to monitor individual CPU core usage on these servers. It is possible for an end-user to have a run-away process (e.g. Internet Explorer or Outlook). The core for that process may spike to near 100% leaving the other cores 'normal'. Thus, the overall CPU usage on the server is just the total of all the cores or if 7 of the cores on a 8 core server are idle and the 8th is running at 100% then 1/8 = 12.5% usage.
What utility can I use to monitor multiple servers ? If the CPU usage for a core is "high" what would I use to determine the offending process and then how could I automatically kill that process if it was on the 'approved kill process' list?
A product from http://www.packettrap.com/ called PT360 would be perfect except they use SMNP to get data and SMNP appears to only give total CPU usage, it's not broken out by an individual core. Take a look at their Dashboard option with the CPU gauge 'gadget'. That's exactly what I need if only it worked at the core level.
Any ideas?
Individual CPU usage is available through the standard windows performance counters. You can monitor this in perfmon.
However, it won't give you the result you are looking for. Unless a thread/process has been explicitly bound to a single CPU then a run-away process will not spike one core to 100% while all the others idle. The run-away process will bounce around between all the processors. I don't know why windows schedules threads this way, presumably because there is no gain from forcing affinity and some loss due to having to handle interrupts on particular cores.
You can see this easily enough just in task manager. Watch the individual CPU graphs when you have a single compute bound process running.
You can give Spotlight on Windows a try. You can graphically drill into all sorts of performance and load indicators. Its freeware.
perfmon from Microsoft can monitor each individual CPU. perfmon also works remote and you can monitor farious aspects of Windows.
I'm not sure if it helps to find run-away processes because the Windows scheduler dos not execute a process always on the same CPU -> on your 8 CPU machine you will see 12.5 % usage on all CPU's if one process runs away.

Resources