I'm using Kapacitor to pre-process complex regex searches to optimise my Grafana rendering performance.
This is my Kapacitor script:
dbrp "telegraf"."autogen"
stream
|from()
.database('telegraf')
.measurement('access-log')
.where(lambda: ("request" =~ /\/service\/endpoint1.*/ OR "request" =~ /\/service\/.*\/endpoint1.*/ ))
|eval(lambda: 'endpoint1')
.as('service')
.keep('service','request')
|influxDBOut()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('access-log')
.tag('kapacitor', 'true')
The issue comes when I check the database and Kapacitor has created new points or entries in the database for the processed entries instead of adding the new field to the existing Points.
Is there any way of making Kapacitor to enrich the data instead of duplicating?
Related
I have a CSV source with a large number of columns and I want to build 2 things:
Load the CSV to my Neo4j graph and turn every row to a node.
Build relationship between nodes based on a cosine similarity (above some threshold alpha).
Here is what I have already done for (1):
WITH "https://drive.google.com/u/0/ucid=1FRi5YmWNQJZ2xeTKNO4b12xYGOTWk1jL&export=download" AS data_url
LOAD CSV WITH HEADERS FROM data_url AS requests
But it returns me an error "Query cannot conclude with LOAD CSV (must be RETURN or an update clause)"
Should I transform the data to be in a long format (with data_key and data_value columns) and use the following?:
// For each request (request_id), collect it's attributes into key-value pairs
WITH requests.request_id AS request_id,
COLLECT([requests.data_key, requests.data_value]) AS keyValuePairs
WITH request_id,
apoc.map.fromPairs(keyValuePairs) AS map
// Each request converts to a node with it's attributes:
MERGE (r:Requests {request_id:request_id})
SET r += map
// Show all nodes:
MATCH (n) RETURN n
You can simply return the rows with a RETURN statement?
WITH "https://drive.google.com/u/0/ucid=1FRi5YmWNQJZ2xeTKNO4b12xYGOTWk1jL&export=download" AS data_url LOAD CSV WITH HEADERS FROM data_url AS requests
RETURN * LIMIT 5
I don't know what's inside the CSV, but let's assume it's a bunch of various columns with float values. With the LOAD CSV you have to manually cast all values to float.
WITH "https://drive.google.com/u/0/ucid=1FRi5YmWNQJZ2xeTKNO4b12xYGOTWk1jL&export=download" AS data_url LOAD CSV WITH HEADERS FROM data_url AS requests
CREATE (r:Request)
SET r.data = [toFloat(requests.column1), toFloat(requests.column2)]
Notice that I've stored an array of floats as the property.
This is due to the way the kNN algorithm in GDS library works.
Next, it is advisable to use the Graph Data Science library and specifically the kNearestNeighbor algorithm.
First, you have to project the in-memory graph.
CALL gds.graph.create('requests', 'Request', '*', {nodeProperties: ['data']})
And then run the knn algorithm, perhaps to write the results back to the database:
CALL gds.beta.knn.write('requests', {
nodeWeightProperty: 'data',
writeRelationshipType: 'SIMILAR',
writeProperty: 'score',
topK: 10
})
With the kNN algo, you can't define the similarity threshold, only the topK neighbors. However, you could the the plain cosine similarity algorithm to define the similarity threshold.
MATCH (r:Request)
WITH {item:id(r), weights: r.data} AS userData
WITH collect(userData) AS data
CALL gds.alpha.similarity.cosine.write({
data: data,
similarityCutoff: 0.5
})
in a simple tick script, how can i query points and edit some key/values ?
i have this tick script:
var data = batch
|query(''' SELECT * FROM "telegraf"."autogen"."cpu" ''')
.period(5m)
.every(10s)
.groupBy(*)
|influxDBOut()
.database('telegraf)
.retentionPolicy('autogen')
.measurement('modified_data)
that queries some data, i want to change the CPU field on each point and add 5 to its value.
how can i do that ?
thanks.
Dave.
Normally, you change the fields inside CPU measurement.
For example, let's say your CPU measurement contains a field named time_idle, then you just have to insert an "eval" node before the output node.
var data = batch
|query(''' SELECT * FROM "telegraf"."autogen"."cpu" ''')
.period(5m)
.every(10s)
.groupBy(*)
|eval(lambda: "time_idle" + 5)
.as('time_idle_plus_5')
|influxDBOut()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('modified_data')
Would be a good idea to read more about eval node here and about TICKScript nodes in general.
Good afternoon,
I have created the following tickscript with a standard tickstack setup.
Which includes: InfluxDB(latest version) and kapacitor(latest version):
dbrp "derivatives"."default"
var data = batch
|query('select sum(value) from "derivatives"."default".derivative_test where time > now() - 10m')
.every(1m)
.period(2m)
var slope = data
|derivative('value')
.as('slope')
.unit(2m)
slope
|eval(lambda: ("slope" - "value") / "value")
.as('percentage')
|alert()
.crit(lambda: "percentage" <= -50)
.id('derivative_test_crit')
.message('{{ .Level }}: DERIVATIVE FOUND!')
.topic('derivative')
// DEBUGGING
|influxDBOut()
.database('derivatives')
.measurement('derivative_logs')
.tag('sum', 'sum')
.tag('slope', 'slope')
.tag('percentage', 'percentage')
But every time i want to define it i get the following message:
batch query is not allowed to request data from "derivatives"."autogen"
I never had this problem before with stream's but every batch tick script i write returns the same message.
My kapacitor user has full admin privs and i am able to get the data via a curl request, does anyone have any idea what could possibly be the problem here?
My thanks in advance.
Change this
dbrp "derivatives"."default"
var data = batch
|query('select sum(value) from "derivatives"."default".derivative_test where time > now() - 10m')
to this:
dbrp "derivatives"."autogen"
var data = batch
|query('select sum(value) from "derivatives"."autogen".derivative_test where time > now() - 10m')
It might not be obvious, but the retention policy is most likely incorrect.
If you run SHOW RETENTION POLICIES on the derivatives database you will see the RP's. I suspect you have an RP of autogen, which is the default RP. However "default" doesn't normally exist as an RP unless you create it, it just signifies that it is the default RP, if that makes sense?
RP Documentation might help clear it up Database Documentation.
default autogen RP
I’m trying to aggregate data over the previous week hourly, and compute summary statistics like median, mean, etc.
I’m using something like this:
var weekly_median = batch
|query('''SELECT median("duration") as week_median
FROM "db"."default"."profiling_metrics"''')
.period(1w)
.every(1h)
.groupBy(*)
.align()
|influxDBOut()
.database('default')
.measurement('summary_metrics')
The query works as expected, except that when recording and replaying data to test with
kapacitor record batch -task medians -past 30d
kapacitor replay -task medians -recording $rid -rec-time
the data is missing for the last period (1 week in this case). If I change the period to 1 day, all data is replayed except the final day’s worth.
Is this a problem with my tickscript, the way I’m recording data, or the way I’m replaying it?
I see, I need to do the aggregation in Kapacitor, not Influx. This seems to be a known issue, but finding documentation on it was tricky. https://github.com/influxdata/kapacitor/issues/1257 and https://github.com/influxdata/kapacitor/issues/1258 were helpful. The solution is to instead do something like:
var weekly_median = batch
|query('''SELECT "duration"
FROM "db"."default"."profiling_metrics"
WHERE "result" =~ /passed/''')
.period(1w)
.every(1h)
.groupBy(*)
.align()
|median('duration')
.as('week_median')
|influxDBOut()
.database('default')
.measurement('summary_metrics')
I've just started learning about Rails security, and I'm wondering how I can avoid security issues while allowing users to upload CSV files into our database. We're using Postgres' "copy from stdin" functionality to upload the data from the CSV into a temp table, which is then used for upserts into another table. This is the basic code (thanks to this post):
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY temp_table (col1, col2) FROM STDIN DELIMITER '|'")
# read column values from the CSV line by line in the following format:
# attributes = {column_1: 'column 1 data', column_2: 'column 2 data'}
# line = "#{attributes.values.join('|')}\n"
rc.put_copy_data line
# wrap up copy process & insert into & update primary table
I am wondering what I can or should do to sanitize the column values. We're using Rails 3.2 and Postgres 9.2.
No action is required; COPY never interprets the values as SQL syntax. Malformed CSV will produce an error due to bad quoting / incorrect column count. If you're sending your own data line-by-line you should probably exclude a line containing a single \. followed by a newline, but otherwise it's rather safe.
PostgreSQL doesn't sanitize the data in any way, it just handles it safely. So if you accept a string ');DROP TABLE customer;-- in your CSV it's quite safe in COPY. However, if your application reads that out of the database, assumes that "because it came from the database not the user it's safe," and interpolates it into an SQL string you're still just as stuffed.
Similarly, incorrect use of PL/PgSQL functions where EXECUTE is used with unsafe string concatenation will create problems. You must use of format and the %I or %L specifiers, use quote_literal / quote_ident, or (for literals) use EXECUTE ... USING.
This is not just true of COPY, it's the same if you do an INSERT of the manipulated data then use it unsafely after reading it back from the DB.