Using the wmi_exporter or the scollector_exporter with Prometheus I am finding it difficult to get accurate CPU usage. Here is the metrics I am using and the query I am using for scollector:
os_cpu with returns: 1.54432653e+07
I do a query with rate:
rate(os_cpu{exported_instance="myHost"}[30s])
Here is the graph I have come up with from this query in Grafana
os_cpu returns a overall CPU usage i.e. all cores, and comparing this with Taskmanager in Windows it does not add up as that shows 100% max. It cannot be possible to get 300% CPU usage.
What can I do with my query to get a more accurate measurement?
For now you might have found the answer, but anyway... This seems to be useful:
100 - (avg by (instance) (irate(windows_cpu_time_total{mode="idle", instance=~"$server"}[1m])) * 100)
from grafana dashboards library: https://grafana.com/grafana/dashboards/12566
If you have multiple cores the usage can go above 100%
I suggest you use
100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[5m])) * 100)
Here is a more detailed blog post about it
Related
I'm trying to figure out the PromQL for an SLO for latency, where we want 90% of all requests to be served in 1000ms or less.
I can get the 90th percentile of requests with this:
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{instance="foo"}[1h]) ) )
And I can find what percentage of ALL requests were served in 1000ms or less with this.
((sum(rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]))) / (sum (rate(MyMetric_Request_Duration_count{instance="foo"}[1h])))) *100
Is it possible to combine these into one query that tells me what percentage of requests in the 90th percentile were served in 1000ms or less?
I tried the most obvious (to me anyway) solution, but got no data back.
histogram_quantile( 0.90, sum by (le) ( rate(MyMetric_Request_Duration_bucket{le="1000",instance="foo"}[1h]) ) )
The goal is to get a measure that shows For the 90th percentile of requests, how many of those requests were under 1000ms? Seems like this should be simple but I can't find a PromQL query that allows me to do it.
Welcome to SO.
Out of all the requests how many are getting served under 1000ms, to find that I would divide the total number of requests under 1000ms by the total number of requests.. In my gcp world, it translates to a query like this:
You are basically measuring your
(sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination",namespace="abcxyz",le="1000"}[1m]))/sum(rate(istio_request_duration_milliseconds_count{reporter="destination",namespace="abcxyz"}[1m])))*100
Once you have a graph setup with the above query in grafana, you can setup an alert on anything below 93 that way you are alerted even before your reach your SLO of 90%.
Prometheus doesn't provide a function, which could be used for calculating the share (aka the percentage) of requests served in under one second from histogram buckets. But such a function exists in VictoriaMetrics - this is Prometheus-like monitoring system I work on. The function is histogram_share(). For example, the following query returns the share of requests with durations smaller than one second served during the last hour:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
Then the following query can be used for alerting when the share or requests, which are served in less than one second, drops below 90%:
histogram_share(1s, sum(rate(http_request_duration_seconds_bucket[1h])) by (le)) < 0.9
Please note that all the functions, which work over histogram buckets, return the estimated results. Their accuracy highly depends on the used histogram buckets' boundaries. See this article for details.
I'm trying to find the execution time of GDS algorithms using the community edition of Neo4j. Is there any way to find it rather than query logging? Since this facility is specific to the enterprise edition.
Update:
I did as suggested. Why the result is 0 for the computeMillis and preProcessingMillis?
Update 2:
The following table indicates the time in ms required for running the Yen algorithm to retrieve one path for each topology. However, the time does not dependent on the graph size. Why? is it normal to have such results?
When you are executing the mutate or the write mode of the algorithm, you can YIELD the computeMillis property, which can tell you the execution time of the algorithm. Note that some algorithms like PageRank have more properties available to be YIELD-ed
preProcessingMillis - Milliseconds for preprocessing the graph.
computeMillis - Milliseconds for running the algorithm.
postProcessingMillis - Milliseconds for computing the
centralityDistribution.
writeMillis - Milliseconds for writing result data back.
I have many influxdb continuous queries(CQ) used to downsample data over a period of time on several occasions. At one point, the load became high and influxdb went to out of memory at the time of executing continuous queries.
Say I have 10 CQ and all the 10 CQ execute in influxdb at a time. That impacts the memory heavily. I am not sure whether there is any way to evenly space out or have some delay in executing each CQ one by one. My speculation is executing all the CQ at the same time makes a influxdb crash. All the CQ are specified in influxdb config. I hope there may be a way to include time delay between the CQ in the influx config. I didn't know exactly how to include the time delay in the config. One sample CQ:
CREATE CONTINUOUS QUERY "cq_volume_reads" ON "metrics"
BEGIN
SELECT sum(reads) as reads INTO rollup1.tire_volume FROM
"metrics".raw.tier_volume GROUP BY time(10m),*
END
And also I don't know whether this is the best way to resolve the problem. Any thoughts on this approach or suggesting any better approach will be much appreciated. It would be great to get suggestions in using debugging tools for influxdb as well. Thanks!
#Rajan - A few comments:
The canonical documentation for CQs is here. Much of what I'm suggesting is from there.
Are you using back-referencing? I see your example CQ uses GROUP BY time(10m),* - the * wildcard is usually used with backreferences. Otherwise, I don't believe you need to include the * to indicate grouping by all tags - it should already be grouped by all tags.
If you are using backreferences, that runs the CQ for each measurement in the metrics database. This is potentially very many CQ executions at the same time, especially if you have many CQ defined this way.
You can set offsets with GROUP BY time(10m, <offset>) but this also impacts the time interval used for your aggregation function (sum in your example) so if your offset is 1 minute then timestamps will be a sum of data between e.g. 13:11->13:21 instead of 13:10 -> 13:20. This will offset execution but may not work for your downsampling use case. From a signal processing standpoint, a 1 minute offset wouldn't change the validity of the downsampled data, but it might produce unwanted graphical display problems depending on what you are doing. I do suggest trying this option.
Otherwise, you can try to reduce the number of downsampling CQs to reduce memory pressure or downsample on a larger timescale (e.g. 20m) or lastly, increase the hardware resources available to InfluxDB.
For managing memory usage, look at this post. There are not many adjustments in 1.8 but there are some.
when I use the cadvisor to get the information about the cpu in a docker container,I get the information as follow:
my question is that how to caculate the cpu usage and load,which is the same as Prometheus,through the information returned by cadvisor?how Prometheus caculate the cpu usage?
The algorithm that Prometheus uses for rate() is a little intricate due to handling of issues like alignment and counter resets as explained in Counting with Prometheus.
The short version is to subtract first value from the last value, and divide by the time they are over. It's probably easiest to use Prometheus rather than doing this yourself.
Below query should return your top 10 containers which are consuming most CPU time:
topk(10, sum(irate(container_cpu_usage_seconds_total{container_label_com_docker_swarm_node_id=~".+", id=~"/docker/.*"}[$interval])) by (name)) * 100
I'm just testing influxdb 1.3.5 for storing a small number (~30-300) of very long integer series (worst case: (86400)*(12*365) [sec/day * ((days/year)*12) * 1 device] = 378.432.000)
e.g. the number of total points would be for 320 devices: (86400)*(12*365)*320 [sec/day * ((days/year)*12) * 320 devices] = 121.098.240.000)
The series cardinality is low, it equals the number of devices. I'm using second-precision timestamps (that mode is enabled when I commit to influxdb via the php-API.
Yes, I really need to keep all the samples, so downsampling is not an option.
I'm inserting the samples as point-arrays of size 86400 per request sorted from oldest to newest. The behaviour is similar (OOM in both cases) for inmem and tsi1 indexing modes.
Despite all that, I'm not able to insert this number of points to the database without crashing it due to out of memory. The host-vm has 8GiB of RAM and 4GiB of Swap which fill up completely. I cannot find anything about that setup being problematic in the documentation. I cannot find a notice that indicates this setup should result in a high RAM usage at all...
Does anyone have a hint on what could be wrong here?
Thanks and all the best!
b-
[ I asked the same question here but received no replies, that's the reason for crossposting: https://community.influxdata.com/t/ever-increasing-ram-usage-with-low-series-cardinality/2555 ]
I found out what the issue most likely was:
I had a bug in my feeder that caused timestamps not being updated to lots of points with distinct values were written over and over again to the same timestamp/tag combination.
If you experience something similar, try double-checking each step in the pipeline for a time concerning error.
This was not the issue unfortunately, the ram usage rises nevertheless then importing more points than before.