I am struggling to get munin reporting working when running a Tsung load test.
My set up is as follows.
Web site staging server (staging4):
2 CPUs
Tsung server
2 CPUs
My Tsung server has an SSH tunnel to staging4 on port 4950 see my tsung.xml configuration below:
<monitoring>
<monitor host="localhost" type="munin">
<munin port="4950" />
</monitor>
</monitoring>
When I start my load test I get the following error message every 10 seconds:
=INFO REPORT==== 16-Nov-2011::16:04:09 ===
ts_os_mon_munin:(4:<0.72.0>) CPU usage value from munin too high, skip (host "ip-10-48-177-212.housetrip.com" , cpu 8761644.1)
I maybe wrong but I think this is because our staging 4 server has 2 CPUs and so the resulting CPU % is greater than 100%.
I checked through the Tsung code and their didn't seem to be an option to set the number of CPUs referenced in the monitoring XML element https://github.com/processone/tsung/blob/master/src/tsung_controller/ts_config.erl
However there does seem to be a CPU setting on the munin plugin wrapper https://github.com/processone/tsung/blob/master/src/tsung_controller/ts_os_mon_munin.erl
Has anyone come across this before? Is there anyway I can get the munin values to be returned in my log file?
Any suggestions would be greatly appreciated.
Many thanks
I haven't worked with munin, but I know that Tsung doesn't handle multicore CPUs very well.
To avoid Tsung crashes when running massive Tsung load from a client I used this workaround on a 4 core CPU.
<clients>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
<client host="myhostname" use_controller_vm="false" weight="1"/>
</clients>
As you can see, the trick is to set up one client Tsung erlang node per available core.
Maybe this trick can solve your munin problem also.
Related
I am currently trying to host a small spring boot backend with an oauth server via docker.
The vServer I choose has unfortunately a limit of 400 Process/Kernel-Threads set in the /proc/user_beancounters.
When starting the jwilder/nginx-proxy, the spring boot app with db (set to only 1 tomcat thread) and the Keycloak server with db this limit is exceeded and everything stalls.
My approach is to limit the workers in Keycloak since I don't need this many.
14:35:23,948 INFO [org.wildfly.extension.io] (ServerService Thread Pool -- 40) WFLYIO001: Worker 'default' has auto-configured to 8 IO threads with 64 max task threads based on your 4 available processors
But I really can't find any explanation on how to configure this parameter in Keycloak.
So the question: How can I configure the limit? Or maybe there is a better approach to the problem?
You can configure this in standalone.xml of the WildFly/Keycloak instance. Parameters are under io subsystem:
<subsystem xmlns="urn:jboss:domain:io:3.0">
<worker name="default" io-threads="8" task-max-threads="64"/>
<buffer-pool name="default"/>
</subsystem>
I'm trying to setup a dask distributed cluster, I've installed dask on three machines to get started:
laptop (where searchCV gets called)
scheduler (small box where the dask scheduler process lives)
HPC (Large box expected to do the work)
I have dask[complete] installed on the laptop and dask on the other machines.
The worker and scheduler start fine and I can see the dashboards, but I can't send them anything. Running GridSearchCV on the laptop get's a result but it comes from the laptop alone, the worker sits idle.
All machines are windows 7 (HPC is 10) I've checked the ports with netstat and it appears that it is really listening where it is supposed to.
When runnign a small example I get the following error:
from dask.distributed import Client
scheduler_address = 'tcp://10.X.XX.XX:8786'
client = Client(scheduler_address)
def square(x):
return x ** 2
def neg(x):
return -x
A = client.map(square, range(10))
B = client.map(neg, A)
total = client.submit(sum, B)
print(total.result())
INFO - Batched Comm Closed: in <closed TCP>: ConnectionAbortedError: [WinError 10053] An established connection was aborted by the software in your host machine
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback <bound method Client._heartbeat of <Client: scheduler='tcp://10.X.XX.XX:8786' processes=1 cores=10
I've also filed a bug report as I don't know if this is a bug or ineptitude on my part (I'm guessing later)
Running client.get_versions(check=True) revealed all sorts of issues despite a clean install with -U. Fixing the environments to be the same fixed that. The laptop can have different versions of stuff installed, at least it worked for the differences I have, YMMV.
We have a popular iPhone app where people duel each other a la Wordfeud. We have almost 1 M registered users today.
During peak hours the app gets really long response times, and there are also quite a lot of time outs. We have tried to find the bottleneck, but have had a hard time doing so.
CPU, memory and I/O are all under 50 % on all servers. The problem ONLY appears during peak hours.
Our setup
1 VPS with nginx (1.1.9) as load balancer
4 front servers with Ruby (1.9.3p194) on Rails (3.2.5) / Unicorn (4.3.1)
1 database server with PostgreSQL 9.1.5
The database logs doesn't show enough long request times to explain all the timeouts shown in the nginx error log.
We have also tried to build and run the app directly against the front servers (during peak hour when all other users are running against the load balancer). The surprising thing is that the app bypassing the load balancer is quick as a bullet even under peak hours.
NGINX SETTINGS
worker_processes=16
worker_connections=4096
multi_accept=on
LINUX SETTINGS
fs.file-max=13184484
net.ipv4.tcp_rmem="4096 87380 4194304"
net.ipv4.tcp_wmem="4096 16384 4194304"
net.ipv4.ip_local_port_range="32768 61000"
Why is the app bypassing the load balancer so fast?
Can nginx as load balancer be the bottle neck?
Is there any good way to compare timeouts in nginx with timeouts in the unicorns to see where the problem resides?
Depending on your settings nginx might be the bottleneck...
Check/tune the following settings in nginx:
the worker_processes setting (should be equal to the number of cores/cpus)
the worker_connections setting (should be very high if you have lots of connections at peak)
set multi_accept on;
if on linux, in nginx make sure you're using epoll (use epoll;-directive)
check/tune the following settings of your OS:
number of allowed open file descriptors (sysctl -w fs.file-max=999999 on linux)
tcp read and write buffers (sysctl -w net.ipv4.tcp_rmem="4096 4096 16777216" and
sysctl - net.ipv4.tcp_wmem="4096 4096 16777216" on linux)
local port range (sysctl -w net.ipv4.ip_local_port_range="1024 65536" on linux)
Update:
so you have 16 workers and 4096 connections per workers
which means a maximum of 4096*16=65536 concurrent connections
you probably have multiple requests per browser (ajax, css, js, page itself, any images on the page, ...), let's say 4 request per browser
that allows for slightly over 16k concurrent users, is that enough for your peaks?
How do you set up your upstream server group and what is the load balancing method you use?
It's hard to imagine that Nginx itself is the bottleneck. Is it possible that some upstream app servers get hit much more than others and start to refuse connection due to backlog is full? See this load balancing issue on Heroku and see if you can get more help there.
After nginx version 1.2.2, nginx provides this least_conn. That might be an easy fix. I haven't tried it myself yet.
Specifies that a group should use a load balancing method where a
request is passed to the server with the least number of active
connections, taking into account weights of servers. If there are
several such servers, they are tried using a weighted round-robin
balancing method.
I'm using tomcat 6 and HypericHQ for monitoring via JMX.
The issue is the following:
hyperic, overtime, opens hundreds of jmx connection and never closes them.. after few hours our tomcat server is using 100% cpu without doing anything.
Once I stop hyperic agent, tomcat will go back to 0-1% cpu..
Here is what we are seeing virtual vm:
http://forums.hyperic.com/jiveforums/servlet/JiveServlet/download/1-11619-37096-2616/Capture.PNG
I don't know if this is an hyperic issue or not, but I wonder if there is an option to fix it via tomcat/java configuration? The reason that I don't know if this is an hyperic or a tomcat/java configuration issue is because that when we use hyperic on other standard java daemon it doesn't have the same connection leak issue.
The JMX is exposed using Spring, and it's working great when connecting with JMX clients (JConsole/VisualVM). When I close the client, I see that the number of connections drops by one.
Is there any thing that we can do to fix this via java configuration? (forcing it to close a connection that is open for more than X seconds?)
One more thing, in tomcat we see (from time to time) the following message (while hyperic is running):
Mar 7, 2011 11:30:00 AM ServerCommunicatorAdmin reqIncoming
WARNING: The server has decided to close this client connection.
Thanks
I'm trying to run a NAS-UPC benchmark to study it's profile. UPC uses MPI to communicate with remote processes .
When I run the benchmark with 64 processes , i get the following error
upcrun -n 64 bt.C.64
"Timeout in making connection to remote process on <<machine name>>"
Can anybody tell me why this error occurs ?
this probably means that you're failing to spawn the remote processes - upcrun delegates that to a per-conduit mechanism, which may involve your scheduler (if any). my guess is that you're depending on ssh-type remote access, and that's failing, probably because you don't have keys, agent or host-based trust set up. can you ssh to your remote nodes without password? sane environment on the remote nodes (paths, etc)?
"upcrun -v" may illuminate the problem, even without resorting to the man page ;)