Provide server-side metrics on APIs usage - docker

I have service deployed through a stack on a swarm.
Let's say :
someStack :
SomeServer : ...
myApplication: ....
Above this, there is a traefik server, that allows me to call various services, but also to map different urls to APIs/subservices : since the above service provide actually many sub-services such as (seen from container POV):
/myApplication/getUsers/For/Area/51
/myApplication/getUsers/Admins
/myApplication/ping/enclyclopedia&code=42
/myApplication/bricks/list&code=0937
along with other endpoints from other stacks / services (/otherApplication/toto, /yaApp/titi, etc.)
The matching endpoints being (from traefik POV) :
/users&area=51
/getadmins
/ask&code=42
/listbricks&code=0937
Theses works well... Now, Il would like to be able to perform stats on the usage of each endpoint (eg using grafana), related to myApplication overall stats .
Something like :
/users : 57 % of myApplication calls, 33 % of myApplication total response time, 15 % of myApplication total errors
/getadmins : 33 % of myApplication call, 7% of myApplication total response time, 85 % of myApplication total errors
/ask : 7 % of myApplication call, 40% of myApplication total response time, 0 % of myApplication total errors
/listbricks : 3 % of cmyApplication all, 20% of myApplication total response time, 0 % of myApplication total errors
The metrics I have so far are those provided by cAdvisor and traefik itself. I am using using prometheus to pull them and build metrics on top of them.
Regarding traefik's metrics, I can't see any that match my need...
I do not own 'myApplication', and thus cannot basicely implement some kind of instrumentation from the inside (or not a trivial way).
I could also build metrics over traefik access logs, but I am mainly wondering if such metrics, or trick with existing metrics, could allow me to perform such stats on my application usage.
Any idea?

Related

Neo4j GraphSage training does not log anything

I am working on extracting graph embeddings with training GraphSage algorihm. I am working on a large graph consisting of (82,339,589) nodes and (219,521,164) edges. When I checked with ":queries" command the query is listed as running. Algorithm started in 6 days ago. When I look the logs with "docker logs xxx" the last logs listed as
2021-12-01 12:03:16.267+0000 INFO Relationship Store Scan (RelationshipScanCursorBasedScanner): Imported 352,492,468 records and 0 properties from 16247 MiB (17,036,668,320 bytes); took 59.057 s, 5,968,663.57 Relationships/s, 275 MiB/s (288,477,487 bytes/s) (per thread: 1,492,165.89 Relationships/s, 68 MiB/s (72,119,371 bytes/s))
2021-12-01 12:03:16.269+0000 INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING Actual
memory usage of the loaded graph: 8602 MiB
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:64076] ] GraphSageTrain ::
Start
There is a way to see detailed logs about training process. Is it normal for taking 6 days for graphs with shared sizes ?
It is normal for GraphSAGE to take a long time compared to FastRP or Node2Vec. Starting in GDS 1.7, you can use
CALL gds.beta.listProgress(jobId: String)
YIELD
jobId,
taskName,
progress,
progressBar,
status,
timeStarted,
elapsedTime
If you call without passing in a jobId, it will return a list of all running jobs. If you call with a jobId, it will give you details about a running job.
This query will summarize the details for job 03d90ed8-feba-4959-8cd2-cbd691d1da6c.
CALL gds.beta.listProgress("03d90ed8-feba-4959-8cd2-cbd691d1da6c")
YIELD taskName, status
RETURN taskName, status, count(*)
Here's the documentation for progress logging. The system monitoring procedures might also be helpful to you.

Measure the duration of x amount of requests while using K6

I would like to use K6 in order to measure the time it takes to proces 1.000.000 requests (in total) by an API.
Scenario
Execute 1.000.000 (1 million in total) get requests by 50 concurrent users/theads, so every user/thread executes 20.000 requests.
I've managed to create such a scenario with Artillery.io, but I'm not sure how to create the same one while using K6. Could you point me in the right direction in order to create the scenario? (Most examples are using a pre-defined duration, but in this case I don't know the duration -> this is exactly what I want to measure).
Artillery yml
config:
target: 'https://localhost:44000'
phases:
- duration: 1
arrivalRate: 50
scenarios:
- flow:
- loop:
- get:
url: "/api/Test"
count: 20000
K6 js
import http from 'k6/http';
import {check, sleep} from 'k6';
export let options = {
iterations: 1000000,
vus: 50
};
export default function() {
let res = http.get('https://localhost:44000/api/Test');
check(res, { 'success': (r) => r.status === 200 });
}
The iterations + vus you've specified in your k6 script options would result in a shared-iterations executor, where VUs will "steal" iterations from the common pile of 1m iterations. So, the faster VUs will complete slightly more than 20k requests, while the slower ones will complete slightly less, but overall you'd still get 1 million requests. And if you want to see how quickly you can complete 1m requests, that's arguably the better way to go about it...
However, if having exactly 20k requests per VU is a strict requirement, you can easily do that with the aptly named per-vu-iterations executor:
export let options = {
discardResponseBodies: true,
scenarios: {
'million_hits': {
executor: 'per-vu-iterations',
vus: 50,
iterations: 20000,
maxDuration: '2h',
},
},
};
In any case, I strongly suggest setting maxDuration to a high value, since the default value is only 10 minutes for either executor. And discardResponseBodies will probably help with the performance, if you don't care about the response body contents.
btw you can also do in k6 what you've done in Artillery, have 50 VUs start a single iteration each and then just loop the http.get() call 20000 times inside of that one single iteration... You won't get a very nice UX that way, the k6 progressbars will be frozen until the very end, since k6 will have no idea of your actual progress inside of each iteration, but it will also work.

Need help in understanding httperf num_calls and num_conns

When I run httperf with following options, the output is easy to understand.
Options: Make total 10 connections (num-conns) at rate of 10 (rate) connections/second with 2 request calls per connection (num-calls).
Output: 10 connections with 20 request calls
httperf -v --server www.example.com --wlog=n,$HOME/tmp/reqs.txt_httperf --rate=10 --num-conns=10 --num-calls=2 --hog
Total: connections 10 requests 20 replies 10 test-duration 1.575 s
However, when I use following options, httperf output, output is confusing.
Options: Make total 4 connections (num-conns) at rate of 10 (rate) connections/second with 6 request calls per connection (num-calls).
httperf -v --server www.example.com --wlog=n,$HOME/tmp/reqs.txt_httperf --rate=10 --num-conns=4 --num-calls=6 --hog
Total: connections 4 requests 8 replies 4 test-duration 0.455 s
It seems like when num-calls is greater than num-conns, number of requests made are 2*num-conns.
I am not following why num-calls be greater than num-conns. Am I missing anything?
The reason why num-calls is greater than num-conns: on each connection, you can make multiple HTTP transactions (a.k.a "calls"). If num-conns = 4, on each connection, you make 2 transactions, then num-calls would be 8.
Hope this helps.

Erlang/OTP - Timing Applications

I am interested in bench-marking different parts of my program for speed. I having tried using info(statistics) and erlang:now()
I need to know down to the microsecond what the average speed is. I don't know why I am having trouble with a script I wrote.
It should be able to start anywhere and end anywhere. I ran into a problem when I tried starting it on a process that may be running up to four times in parallel.
Is there anyone who already has a solution to this issue?
EDIT:
Willing to give a bounty if someone can provide a script to do it. It needs to spawn though multiple process'. I cannot accept a function like timer.. at least in the implementations I have seen. IT only traverses one process and even then some major editing is necessary for a full test of a full program. Hope I made it clear enough.
Here's how to use eprof, likely the easiest solution for you:
First you need to start it, like most applications out there:
23> eprof:start().
{ok,<0.95.0>}
Eprof supports two profiling mode. You can call it and ask to profile a certain function, but we can't use that because other processes will mess everything up. We need to manually start it profiling and tell it when to stop (this is why you won't have an easy script, by the way).
24> eprof:start_profiling([self()]).
profiling
This tells eprof to profile everything that will be run and spawned from the shell. New processes will be included here. I will run some arbitrary multiprocessing function I have, which spawns about 4 processes communicating with each other for a few seconds:
25> trade_calls:main_ab().
Spawned Carl: <0.99.0>
Spawned Jim: <0.101.0>
<0.100.0>
Jim: asking user <0.99.0> for a trade
Carl: <0.101.0> asked for a trade negotiation
Carl: accepting negotiation
Jim: starting negotiation
... <snip> ...
We can now tell eprof to stop profiling once the function is done running.
26> eprof:stop_profiling().
profiling_stopped
And we want the logs. Eprof will print them to screen by default. You can ask it to also log to a file with eprof:log(File). Then you can tell it to analyze the results. We tell it to collapse the run time from all processes into a single table with the option total (see the manual for more options):
27> eprof:analyze(total).
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
io:o_request/3 46 0.00 0 [ 0.00]
io:columns/0 2 0.00 0 [ 0.00]
io:columns/1 2 0.00 0 [ 0.00]
io:format/1 4 0.00 0 [ 0.00]
io:format/2 46 0.00 0 [ 0.00]
io:request/2 48 0.00 0 [ 0.00]
...
erlang:atom_to_list/1 5 0.00 0 [ 0.00]
io:format/3 46 16.67 1000 [ 21.74]
erl_eval:bindings/1 4 16.67 1000 [ 250.00]
dict:store_bkt_val/3 400 16.67 1000 [ 2.50]
dict:store/3 114 50.00 3000 [ 26.32]
And you can see that most of the time (50%) is spent in dict:store/3. 16.67% is taken in outputting the result, another 16.67% is taken by erl_eval (this is why you get by running short functions in the shell -- parsing them becomes longer than running them).
You can then start going from there. That's the basics of profiling run times with Erlang. Handle with care, eprof can be quite a load on a production system or for functions that run for too long. Especially on a production system.
You can use eprof or fprof.
The normal way to do this is with timer:tc. Here is a good explanation.
I can recommend you this tool: https://github.com/virtan/eep
You will get something like this https://raw.github.com/virtan/eep/master/doc/sshot1.png as a result.
Step by step instruction for profiling all processes on running system:
On target system:
1> eep:start_file_tracing("file_name"), timer:sleep(20000), eep:stop_tracing().
$ scp -C $PWD/file_name.trace desktop:
On desktop:
1> eep:convert_tracing("file_name").
$ kcachegrind callgrind.out.file_name

Monitoring Mongrel queue length

I have a Apache + Haproxy + Mongrel Cluster setup. I want to receive alerts whenever my Mongrel queue length gets too high.
How to I get the current Mongrel Queue length and make it available for alerting tools such as Monit and Nagios?
I know that Haproxy has the information about Mongrel queue as it intelligently sends requests to least busy Mongrel in the cluster. I wonder how it finds out? I need a similar mechanism to generate alerts and/or restart mongrels when such a condition arrives.
Add this to your haproxy config
stats uri /haproxy/hastats
Then use lynx to get the stats like this:
(assuming haproxy runs on port 10000 - adjust to suit)
lynx --dump http://my-server:10000/haproxy/hastats
Each there will be a line for each of your server entries in the haproxy config file, telling you whether it's up or down, how long it's queue is, like this:
Server Queue Sessions Errors
Name Weight Status Act. Bck. Curr. Max. Curr. Max. Limit Cumul. Conn. Resp. Sec. Check Down
primary 1 UP Y - 0 0 68 386 - 134385861 207 699 0 7028 150
secondary 1 UP Y - 0 0 71 248 - 134464984 216 551 0 7129 98
Now all you need is a script to get the current queue (column 6) and feed it into nagios, and you're away!
New Relic's RPM product (www.newrelic.com) maintains information on Mongrel queue length. They have an API that you may be able to use to get near real-time feedback on queue length and adjust load balancing accordingly.
You can get more information on the API at: https://newrelic.tenderapp.com/faqs/docs/data-api
Hopefully that provides some help.

Resources