I have a CSV file containing time-stamped events, and I want to integrate the external time stamps in Esper pattern.
I know how to use the window ext_timed.
For example this works:
select * from stream.win:_ext_timed(timestamps, 5 sec).
But I don't know how to use the external times inside patterns.
For example in the following query the engine internal time is used. I want to use external times with the within guard.
select * from pattern [ a=stream -> stream where timer:within(5 sec) ]
The API for advancing time is in [1].
The EsperIO CSV adapter does pretty much the same thing that you are planning to do and you could look at its source code. It will use the same API [1].
[1] http://www.espertech.com/esper/release-5.2.0/esper-reference/html_single/index.html#api-controlling-time
Related
I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.
I'm collecting measurements with a Telegraf client. Unfortunately, the measurement name isn't static. Rather, it encodes a timestamp (terrible design choice, but out of my hands) as part of its name.
For example, the following 3 lines represent 3 instances of the same measurement, but have different names:
info.quorum.2902864.agree: 6
info.quorum.2902865.agree: 6
info.quorum.2902866.agree: 5
...
is there a way to transform these measurement names into one static name? In other words, I'd like to transform these entries above to:
info.quorum.hello.agree: 6
info.quorum.hello.agree: 6
info.quorum.hello.agree: 5
I saw the rename processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/rename) - but that doesn't support wildcard.
I also saw the regex processor (https://github.com/influxdata/telegraf/tree/master/plugins/processors/regex) but that doesn't support measurement names.
any ideas on how to get this done?
EDIT: some background: the measurements are collected with http input, usin a GJSON path like a.b.*.c
EDIT2: here's what i'm trying to parse. the problem is with the key '2931747', which grows on each subsequent reading:
"quorum" : {
"2931747" : {
"agree" : 8,
"disagree" : 0,
}
So they put actual value there as the key... Downright, eh, unsmart, let's put it this way.
And I won't blame the writers of JSON format parser for not putting a handle for that situation.
So, the answer is: in current form, with HTTP plugin, available parsers & processors - there's no way to shape it in any proper form (unless you can drop that damned number-key completely - then it's trivial).
I would suggest you to push on data providers to make them stop this stupidity.
If that is not an option - you need to write your own processor for that, alas.
It could be either fully standalone (poll the http endpoint, parse the stuff, form a batch of line protocol records, send to influx) - or it could cut it off on producing lines in Influx line protocol as the its output, and be executed with Exec input plugin
I have the current use case:
We have a system that computes different response time metrics for messages that we want to insert in InfluxDB. This system writes JSON entries to a file.
We use telegraf with JSON plugin to extract the fields we want and insert into InfluxDB.
So far so good.
But we have an issue with 1 particular information.
The system will emit messages where mId is the Unique identifier, in the below examples we have 2 uuidXXXX and uuidYYYY:
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
{“meta1”:“value”, “mId”:“uuidXXXX”, “resTime1”:1232332233, “timeWeExitBus”:endTimestamp}
{“meta1”:“value2”, “mId”:“uuidYYYY”, “resTime1”:1232331111, “timeWeEnterBus”:startTimestamp}
And what we want here is to graph the timeInBus which is equal to “timeWeExitBus-timeWeEnterBus” for each unique mId.
So my questions are:
IMU, uuid would be a field not a tag as it is unlimited, same for timeWeExitBus and timeWeEnterBus which would be numeric fields since we want to use functions on them. And timeInBus would be the measurement. Am I right ?
Is this use case a good one for Influx / Telegraf or are we misusing it for this ? IMU, it doesn’t look like a good use case to try to compute this on telegraf side, but I don’t see how to do it in InfluxDB, I initially thought ELAPSED function could help but I end up thinking it doesn’t work here
If it’s a good use case, could you point me to documentation helping implementing this ?
I'm reaching you hoping to find answers about Pentaho data integrator limitation.
I'm currentlty working on a 1 to 1 data source integration and would like to make it n to 1-n. This requires dynamic jobs creation and would like to know if any of came across such issue. My 1 to 1 is working perfectly, it integration form differents data source types (CSV, databases "Mysql, Oracle ...) to same date destination and need to make it n to 1-n.
There is a Metadata Injection Step just for that.
A use case similar to yours is described by Diethard here.
Because it seams that you have a lot of different source format, it may be a good investment to read the use case of Jens, the author of the step, here, which (apart for the automation) is precisely your case.
AFAIK in Pentaho DI, it is not possible to create dynamic transformations for any random data sources. PDI looks for the input columns to be available in the input stream before it loads the data to the target database. For example, if you are using 1 data source (in MySQL) and loading the same to the csv output, the csv output step is expecting the presence of input columns in the data source step (Table input). If you are trying to load any n random data sources you need to define input columns/fields for each of them individually.
Alternatively there are few things which you can explore:
1. Fast Dump in Text File Output step:
There is an option to fast data dump the data set in Text file output step. Here you don't need to define any output column. The input fields will be automatically dumped without formatting as it is. You can use this to map all of the input sources to a csv format and then load it to their targets.
2. Extending Java and Kettle together to build a solution:
PDI allows you to create custom JAVA codes on top of kettle. You can check this blog for more. You can use this idea to create custom code to pass n data sources fields to the kettle as a parameter and execute them. {note: i haven't tried this step, just thinking out loud here}
Hope this helps :)
I'm trying to write a Rails action to stream data where the resulting CSV / XML / JSON file is much larger than the memory limit for the web server. The tricky part is that each item in the dataset is composed from two sources. One is a Postgres DB where I plan to open a CURSOR (or just use id > Y LIMIT X) to batch process the data. The latter is a custom data store but there is basically a cursor object I can use to batch that as well.
My problem is I'm not sure what the best way to iterate over the second data source is. I imagine I'll need a structure to open the cursor and as I consume the data in each batch I'll load the next batch.
This problem seems like it might have been solved already so I'm hoping there's an established pattern I can use.