I have a uwsgi process running a flask application. There is haproxy (running in mode http) sitting between the client and the application.
I am seeing occational haproxy termination state as "SD--" and the Tc = 0 and Tr = -1, and the returned http code is -1. This means that the haproxy encountered a explicit tcp disconnection from the uwsgi server.
Looking at the uwsgi logs, I found that the server was normally processing requests at the same time. But the affected request never reached the server.
Only thing strange about the uwsgi logs at that point of time is that
the Number of requests managed by the current uwsgi worker is greater than the sum total of requests managed by the whole uwsgi app.
like this:
[pid: 22759|app: 0|req: **47188**/**47178**] * POST * => generated 84 bytes in 970 msecs (HTTP/1.1 200) 2 headers in 71 bytes (3 switches on core 98)
I am wondering if this is abnormal, or what what scenarios can these counters be so?
Related
I have a RPi running NGINX and UWSGI serving a webpage and an API via UWSGI.
Web page works fine, both locally and from the web.
API works locally, but not via web. My guess it's either the router or the NGINX configuration.
I am using cloudflare for the DNS, and all appears fine there.
I can GET / POST locally using Postman, but not via the web address. I would greatly appreciate any ideas on where to look.
Output from uwsgi is:
*** Starting uWSGI 2.0.20 (32bit) on [Sat May 14 12:35:08 2022] ***
compiled with version: 8.3.0 on 06 October 2021 05:59:48
os: Linux-5.10.103-v7l+ #1529 SMP Tue Mar 8 12:24:00 GMT 2022
nodename: xxx
machine: armv7l
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /var/www/xxx.xxx/public
detected binary path: /home/pi/.local/bin/uwsgi
*** WARNING: you are running uWSGI without its master process manager ***
your processes number limit is 12393
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on :9090 fd 4
spawned uWSGI http 1 (pid: 3176)
uwsgi socket 0 bound to TCP address 127.0.0.1:34881 (port auto-assigned) fd 3
Python version: 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0xd5c950
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 64408 bytes (62 KB) for 1 cores
*** Operational MODE: single process ***
<<<<<<<<<<<<<<<< Loaded script >>>>>>>>>>>>>>>>
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0xd5c950 pid: 3175 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI worker 1 (and the only) (pid: 3175, cores: 1)
I have a cluster situation consisting of 4 total nodes, 3 servers and 1 management node, working properly.
At the beginning of the month we planned to patch the OS and we started from the first server node with this procedure:
Stop service
S.O. patching
Server restart
Start service
The service of the first patched node named "serverA" fails to restart with this error:
Log entries cluster join:
serverA:
| INFO | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: shared=true ordered=false failed to connect to peer 10.237.110.195( Server serverB:9993):1024 because: java.net.ConnectException: Connection timed out (Connection timed out)
| WARN | region-dm-12 | ache.geode.internal.tcp.Connection | --> Connection: Attempting reconnect to peer 10.237.110.195( Server serverB:9993):1024
ServerMgmt:
| WARN | pool-3-thread-1 | tributed.internal.ReplyProcessor21 | --> 15 seconds have elapsed while waiting for replies: <CreateRegionProcessor$CreateRegionReplyProcessor 44180 waiting for 1 replies from [10.237.110.194( Server serverA:632):1024]> on 10.237.110.225( Management:6033):1024 whose current membership list is: [[10.237.110.196( Server serverC:16805):1024, 10.237.110.225( Management:6033):1024, 10.237.110.195( Server serverB:9993):1024, 10.237.110.194( Server serverA:632):1024]]
The connection between the systems was verified with tcpdumps, udp 1024 is running fine.
We have tried redeploying the service and making numerous attempts but we always get the same error during startup.
Any suggestions? Thank you.
Marco.
I think to see this error message, serverA was probably able to send UDP messages to serverB but it is failing to create a TCP connection. It's hard to say why though - a firewall issue, some TCP configuration issue, ... ?
Check to see if serverB has anything interesting in its logs. Since you are using TCP dump, you should be watching for that TCP connection for serverB:9993, since it looks like that is wwhat failed.
There is no firewall between the systems, we've analyzed again the network connection, during startup from node a, and we can see that the communication can be established between all systems. But what we detected is, that on port 2323 which is configured as locater, the node sends packages to the b and c node, but only receives back packages from the c node, and not from the b node. This is for us again a sign that the b node has an issue. Does it give a way to check our assumption from the b node?
A node ip .194
B node ip .195
C node ip .196
Management ip .225
my haproxy.cfg is :
global
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4096
user haproxy
group haproxy
daemon
stats socket /var/lib/haproxy/stats
defaults
option forwardfor
log global
option httplog
log 127.0.0.1 local3
option dontlognull
retries 3
option redispatch
timeout connect 5000ms
timeout client 5000ms
timeout server 5000ms
listen stats
bind *:9000
mode http
..................................
..............................................
backend testhosts
mode http
balance roundrobin
option tcplog
option tcp-check
# cookie SERVERID
option httpchk HEAD /sabrix/scripts/menu-common.js
server host1 11.11.11.11:9080 check cookie host1
server host2 22.22.22.22:9080 check cookie host2
the log shows :
2020-08-19T16:02:14+08:00 localhost haproxy[22439]: Server Host2 is DOWN, reason: Layer7 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2020-08-19T16:02:14+08:00 localhost haproxy[22439]: Server Host2 is DOWN, reason: Layer7 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2020-08-19T16:02:18+08:00 localhost haproxy[12706]: Server Host2 is DOWN, reason: Layer7 timeout, check duration: 2001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2020-08-19T16:02:19+08:00 localhost haproxy[12706]: Server Host2 is DOWN, reason: Layer7 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2020-08-19T16:02:27+08:00 localhost haproxy[12706]: Server Host2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 138ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2020-08-19T16:02:30+08:00 localhost haproxy[22439]: Server Host2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2020-08-19T16:02:30+08:00 localhost haproxy[22439]: Server Host2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 1ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
2020-08-19T16:02:30+08:00 localhost haproxy[12706]: Server Host2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
at that time( when the host is marked as down), the call result will be 504 error rather than 200.
2020-08-19T20:16:02+08:00 localhost haproxy[3774]: 39898 22.22.22.22 504 POST /url/services
2020-08-19T20:16:02+08:00 localhost haproxy[3774]: 39909 11.11.11.11 200 POST /url/services
my question :
i have set the timeout to 5000ms, why the error was reported when the response time of backend server #2 is over 2000ms ? can i increase the timeout to remove the error ?
I believe that you are looking for timeout check
If "timeout check" is not set haproxy uses "inter" for complete check
timeout (connect + read)
If left unspecified, inter defaults to 2000 ms.
I'm running UWSGI on a server and trying to track when worker processes get OOMed without using dmesg since that requires root privileges. In this environment, if a child was killed with SIGKILL, it's a safe assumption that the OOM killer did that.
UWSGI reports in its logs what signal a child was killed with. This issue (https://github.com/unbit/uwsgi/issues/25) shows an example of logs where a child was reported to have exited with signal 9.
Example:
Oct 20 18:54:28 localhost app: DAMN ! worker 2 (pid: 16100) died, killed by signal 9 :( trying respawn ...
Here's the line of code in UWSGI that's responsible for this message:
if (WIFSIGNALED(waitpid_status)) {
uwsgi_log("DAMN ! worker %d (pid: %d) died, killed by signal %d :( trying respawn ...\n", thewid, (int) diedpid, (int) WTERMSIG(waitpid_status));
}
https://github.com/unbit/uwsgi/blob/65a8d676f3e63a04b07fdcb4e1f92bb6502f024d/core/master.c#L1074
Is there a way to count the number of child processes killed with SIGKILL and surface it as a metric within the metric subsystem thing? I'm also wondering whether a child that exceeds the harakiri timeout is counted as being killed with a signal.
UWSGI does seem to keep a per-worker signal count e.g. "signals": 0, but I'm not sure exactly what that field is counting.
Example from same GitHub issue:
"pid": 11360,
"requests": 294,
"respawn_count": 38,
"rss": 226373632,
"running_time": 628263,
"signals": 0,
"status": "cheap",
"tx": 5178,
"vsz": 380694528
I deployed my app to Digitalocean with Passenger and Nginx. I used apache bench to see how many requests per second I can get on a static page (simple hello world rails view), but I am only getting 4 requests/s.
ab -n 100 http://107.170.100.242/fo
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking 107.170.100.242 (be patient).....done
Server Software: nginx/1.8.0
Server Hostname: 107.170.100.242
Server Port: 80
Document Path: /fo
Document Length: 5506 bytes
Concurrency Level: 1
Time taken for tests: 22.662 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 632600 bytes
HTML transferred: 550600 bytes
Requests per second: 4.41 [#/sec] (mean)
Time per request: 226.617 [ms] (mean)
Time per request: 226.617 [ms] (mean, across all concurrent requests)
Transfer rate: 27.26 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 181 226 65.4 204 445
Waiting: 181 226 65.4 204 445
Total: 181 227 65.4 204 446
It should be literally thousands per second as I am using Nginx. I have been researching this for the entire day without results, can someone please direct me to the right path to solve this?
This would be the nginx config directive that will cause it to bypass the app server and serve the static files directly:
root /var/www/my_app/public;
Are you sure that is right?