Prometheus return nothing after while - monitoring

We are using Prometheus and Grafana for our monitoring and we have a panel for response time however I noticed after while the metrics are missing and there are a lots of gap in the panel (only for response time panel) and they comeback as soon as I restart the app (redeploying it in openshift). the service has been written in Go and the logic for the gathering response time is quite simple.
we declared the metric
var (
responseTime = promauto.NewSummaryVec(prometheus.SummaryOpts{
Namespace: "app",
Subsystem: "rest",
Name: "response_time",
}, []string{
"path",
"code",
"method",
})
)
and fill it in our handler
func handler(.......) {
start := time.Now()
// do stuff
....
code := "200"
path := r.URL.Path
method := r.Method
elapsed := float64(time.Since(start)) / float64(time.Second)
responseTime.WithLabelValues(path, code, method).Observe(elapsed)
}
and query in the Grafana panel is like:
sum(rate(app_rest_response_time_sum{path='/v4/content'}[5m]) /
rate(app_rest_response_time_count{path='/v4/content'}[5m])) by (path)
but the result is like this!!
can anyone explain what do we do wrong or how to fix this issue? is it possible that we facing some kind of overflow issue (the average RPS is about 250)? I'm suspecting this because this happen more often to the routes with higher RPS and response time!

Prometheus records the metrics continuously normally and if you query it, it returns all the metrics it collected for the time you queried.
If there is no metric when you query, that has typically three reasons:
the metric was not there (it happens when the instance restarts and you have a dynamic set of labels and there was no request yet for the label value you queried (in your case there was no query for path='/v4/content'). In such case you should see other metrics of the same job (at least up).
Prometheus had problems storing the metrics. (see the log files of prometheus for that timeframe).
Prometheus was down for that timeframe and therefore did not collect any metrics. (In that case you should have no metrics at all for that timeframe.

Related

Breaking down statistics at API level or Flow level in K6?

I was taking a quick look to K6 from loadimpact.
The graphs that I got so far show TPS, Response Time, Error rates at the global level and that is not too useful.
When I load test, I rather have those stats at the global level, but also at the flow level or at the APi level. This way if for example if I see some high latency I can tell right away if is caused by a single API or if all APIs are slow.
Or I can tell of a given API is giving say HTTP/500 or several different APIs are.
Can K6 show stats like TPS, Response Time, HTTP status at the API level, the flow level and global level?
Thanks
Yes, it can, and you have 3 options here in terms of result presentation (all involve using custom metrics to some degree):
End of test summary printed to stdout.
You output the result data to InfluxDB+Grafana.
You output the result data to Load Impact Insights.
Global stats you get with all three above, and per API endpoint stats you get out of the box with 2) and 3), but to get stats at the flow level you'd need to create custom metrics which works with all three options above. So something like this:
import http from "k6/http";
import { Trend, Rate } from "k6/metrics";
import { group, sleep } from "k6";
export let options = {
stages: [
{ target: 10, duration: "30s" }
]
};
var flow1RespTime = new Trend("flow_1_resp_time");
var flow1TPS = new Rate("flow_1_tps");
var flow1FailureRate = new Rate("flow_1_failure_rate");
export default function() {
group("Flow 1", function() {
let res = http.get("https://test.loadimpact.com/");
flow1RespTime.add(res.timings.duration);
flow1TPS.add(1);
flow1FailureRate.add(res.status == 0 || res.status > 399);
});
sleep(3);
};
This would expand the end of test summary stats printed to stdout to include the custom metrics:

New Relic alert when application stop

I have an application deployed on PCF and have a new relic service binded to it. In new relic I want to get an alert when my application is stopped . I don't know whether it is possible or not. If it is possible can someone tell me how?
Edit: I don't have access to New Relic Infrastructure
Although an 'app not reporting' alert condition is not built into New Relic Alerts, it's possible to rig one using NRQL alerts. Here are the steps:
Go to New Relic Alerts and begin creating a NRQL alert condition:
NRQL alert conditions
Query your app with:
SELECT count(*) FROM Transaction WHERE appName = 'foo'
Set your threshold to :
Static
sum of query results is below x
at least once in y minutes
The query runs once per minute. If the app stops reporting then count will turn the null values into 0 and then we sum them. When the number goes below whatever your threshold is then you get a notification. I recommend using the preview graph to determine how low you want your transactions to get before receiving a notification. Here's some good information:
Relic Solution: NRQL alerting with “sum of the query results”
Basically you need to create a NewRelic Alert with conditions that check if application available, Specially you can use Host not reporting alert condition
The Host not reporting event triggers when data from the Infrastructure agent does not reach the New Relic collector within the time frame you specify.
You could do something like this:
// ...
aggregation_method = "cadence" // Use cadence for process monitoring otherwise it might not alert
// ...
nrql {
// Limitation: only works for processes with ONE instance; otherwise use just uniqueCount() and set a LoS (loss of signal)
query = "SELECT filter(uniqueCount(hostname), WHERE processDisplayName LIKE 'cdpmgr') OR -1 FROM ProcessSample WHERE GENERIC_CONDITIONS FACET hostname, entityGuid as 'entity.guid'"
}
critical {
operator = "below"
threshold = 0
threshold_duration = 5*60
threshold_occurrences = "ALL"
}
Previous solution - turned out it is not that robust:
// ...
critical {
operator = "below"
threshold = 0.0001
threshold_duration = 600
threshold_occurrences = "ALL"
}
nrql {
query = "SELECT percentage(uniqueCount(entityAndPid), WHERE commandLine LIKE 'yourExecutable.exe') FROM ProcessSample FACET hostname"
}
This will calculate the fraction your process has against all other processes.
If the process is not running the percentage will turn to 0. If you have a system running a vast amount of processes it could fall below 0.0001 but this is very unprobable.
The advantage here is that you can still have an active alert even if the process slips out of your current time alert window after it stopped. Like this you prevent the alert from auto-recovering (compared to just filtering with WHERE).

BigQueryIO loads not offloading rows to GCS when early trigger occurs

I'm playing around with BigQueryIO write using loads. My load trigger is set to 18 hours. I'm ingesting data from Kafka with a fixed daily window.
Based on https://github.com/apache/beam/blob/v2.2.0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L213-L231 it seems that the intended behavior is to offload rows to the filesystem when at least 500k records are in a pane
I managed to produce ~ 600K records and waited for around 2 hours to see if the rows were uploaded to gcs, however, nothing was there. I noticed that the "GroupByDestination" step in "BatchLoads" shows 0 under "Output collections" size.
When I use a smaller load trigger all seems fine. Shouldn't the AfterPane.elementCountAtLeast(FILE_TRIGGERING_RECORD_COUNT)))) be triggered?
Here is the code for writing to BigQuery
BigQueryIO
.writeTableRows()
.to(new SerializableFunction[ValueInSingleWindow[TableRow], TableDestination]() {
override def apply(input: ValueInSingleWindow[TableRow]): TableDestination = {
val startWindow = input.getWindow.asInstanceOf[IntervalWindow].start()
val dayPartition = DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC).print(startWindow)
new TableDestination("myproject_id:mydataset_id.table$" + dayPartition, null)
}
})
.withMethod(Method.FILE_LOADS)
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND)
.withSchema(BigQueryUtils.schemaOf[MySchema])
.withTriggeringFrequency(Duration.standardHours(18))
.withNumFileShards(10)
The job id is 2018-02-16_14_34_54-7547662103968451637. Thanks in advance.
Panes are per key per window, and BigQueryIO.write() with dynamic destinations uses the destination as key under the hood, so the "500k elements in pane" thing applies per destination per window.

How can I track the current number of viewers of an item?

I have an iPhone app, where I want to show how many people are currently viewing an item as such:
I'm doing that by running this transaction when people enter a view (Rubymotion code below, but functions exactly like the Firebase iOS SDK):
listing_reference[:listings][self.id][:viewing_amount].transaction do |data|
data.value = data.value.to_i + 1
FTransactionResult.successWithValue(data)
end
And when they exit the view:
listing_reference[:listings][self.id][:viewing_amount].transaction do |data|
data.value = data.value.to_i + -
FTransactionResult.successWithValue(data)
end
It works fine most of the time, but sometimes things go wrong. The app crashes, people loose connectivity or similar things.
I've been looking at "onDisconnect" to solve this - https://firebase.google.com/docs/reference/ios/firebasedatabase/interface_f_i_r_database_reference#method-detail - but from what I can see, there's no "inDisconnectRunTransaction".
How can I make sure that the viewing amount on the listing gets decremented no matter what?
A Firebase Database transaction runs as a compare-and-set operation: given the current value of a node, your code specifies the new value. This requires at least one round-trip between the client and server, which means that it is inherently unsuitable for onDisconnect() operations.
The onDisconnect() handler is instead a simple set() operation: you specify when you attach the handler, what write operation you want to happen when the servers detects that the client has disconnected (either cleanly or as in your problem case involuntarily).
The solution is (as is often the case with NoSQL databases) to use a data model that deals with the situation gracefully. In your case it seems most natural to not store the count of viewers, but instead the uid of each viewer:
itemViewers
$itemId
uid_1: true
uid_2: true
uid_3: true
Now you can get the number of viewers with a simple value listener:
ref.child('itemViewers').child(itemId).on('value', function(snapshot) {
console.log(snapshot.numChildren());
});
And use the following onDisconnect() to clean up:
ref.child('itemViewers').child(itemId).child(authData.uid).remove();
Both code snippets are in JavaScript syntax, because I only noticed you're using Swift after typing them.

Gumbo Apache Storm Monitoring Metrics

I am trying to configure Gumbo for storm topology monitoring from here
There is not clear example or usage given on the site, need some clarification on what are the parameter and where to add this code given on the site above
MonitorClient mclient = MonitorClient.forConfig(conf);
// There are multiple metric groups, each with multiple metrics.
// Components have names and multiple instances, each of which has an integer ID
mclient.declare(metricGroup,metric,task_id,component_id);
mclient.increment(metricGroup,metric, 1L , task_id);
TaskHook.registerTo(config);
Now what values we need to provide for MetricGroup, metric, task_id and component_id? If need to find it from each Spout and Bolt how can we do it? Where should this code be placed, is it in topology builder before submitting the topology or in individual Spout/Bolt class under open/prepare methods or somewhere else. Appreciate any help on this question.
I tried few option and below is the config that worked for me, the Group Name can be anything, Metric name is the name of the stream going out from one component to other, taskid can be any unique task number,
conf.put("gumbo.server.kind", "local");
conf.put("gumbo.local.port", 8086); //Any port it must be same in the html file
conf.put("gumbo.start", System.currentTimeMillis()); // should be the same for all calls
conf.put("gumbo.bucketSize", 1000L);
conf.put("gumbo.enabled", true);
conf.put("gumbo.http.host", "hostname");
conf.put("gumbo.http.port", 8086);//Any port it must be same in the html file
conf.put("gumbo.http.app", "gumbo");
conf.put("gumbo.enabled", true);
conf.put("gumbo.server.key", topology_id);
MonitorClient mclient = MonitorClient.connect(conf);
GumboTaskHook.registerTo(conf);
mclient.declare("Backlog",RTConstants.MATCH_LEFT_STREAM,3,RTConstants.TRANSFORM_LEFT_BOLT);
mclient.increment("Backlog",RTConstants.MATCH_LEFT_STREAM, 1L , 3);
mclient.declare("Backlog",RTConstants.MATCH_RIGHT_STREAM,4,RTConstants.TRANSFORM_RIGHT_BOLT);
mclient.increment("Backlog",RTConstants.MATCH_RIGHT_STREAM, 1L , 4);

Resources