Kapacitor task with multiple percentiles - influxdb

I want to aggregate data from last minute from telegraf with kapacitor before putting them into influxdb and I also have this need for calculating few percentiles. And so I wrote a simple tick for test
var firstPerc = stream
|from()
.measurement('my_tmp_measurement_from_telegraf')
var secondPerc = stream
|from()
.measurement('my_tmp_measurement_from_telegraf')
firstPerc
|join(secondPerc)
.as('fp', 'sp')
|percentile('fp.myAggVal', 50.0)
|eval(lambda: "percentile")
.as('50p')
|percentile('sp.myAggVal', 90.0)
|eval(lambda: "percentile")
.as('90p')
|window()
.period(60s)
.every(60s)
.align()
|influxDBOut()
.database('myDBInInflux')
.retentionPolicy('autogen')
In my database, I have only values for 50th percentile, and I am not suprised with that since I use "percentile" in my eval but still, I cannot find in Kapacitor documentation any clue about how to get result I need.
Here you have "visual" result I crave for:
time 50p 90p someOtherP's otherDataICanPropablyHandle
Halp!

You are using the same measurement stream (and the same data in it) twice, so data are popped. First you should save the measurement stream:
var myStream = stream
|from()
.measurement('my_tmp_measurement_from_telegraf')
Next define streams using saved measurement. You should define here proper grouping, evaluations, etc.:
var firstPerc = myStream
|percentile('myAggVal', 50.0)
|eval(lambda: "percentile")
.as('percentile')
|window()
.period(60s)
.every(60s)
.align()
var secondPerc = myStream
|percentile('myAggVal', 90.0)
|eval(lambda: "percentile")
.as('percentile')
|window()
.period(60s)
.every(60s)
.align()
Finaly, it's time to define join stream:
var joinedStreams = firstPerc
|join(secondPerc)
.as('50', '90')
.tolerance(1s)
.streamName('measurementName')
|influxDBOut()
.database('myDBInInflux')
.retentionPolicy('autogen')
.create()
The output:
time 50.percentile 90.percentile
I strongly suggest using .tolerance(), which will group measurements within the same tolerance period.

Related

How to get VWAP using DolphinDB TimeSeriesEngine or ReactiveStateEngine

I am getting live tick data consisting of Time, Symbol Name, Last Traded Price, Cumulative Volume (Daily).
Now how to get VWAP using 1) Custom function 2) TimeSeriesEngine 3) ReactiveStateEngine with DolphinDB? Please Help me. Necessary code is as under.
This is stream table for getting ticks from python
t_colNames=`ts`symbol`price`vol`upd_tick
t_colTypes=`TIMESTAMP`SYMBOL`DOUBLE`DOUBLE`TIMESTAMP
This is stream table to store 1 min OHLC data
ohlc_colNames=`ts`symbol`open`high`low`close`volume`tp`last_tick`upd_1m
ohlc_colTypes=`TIMESTAMP`SYMBOL`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`DOUBLE`TIMESTAMP`TIMESTAMP
This is 1 min OHLC TimeSeriesEngine
OHLC_sm1 = createTimeSeriesEngine(name="OHLC_sm1", windowSize=60000, step=60000, metrics=<[first(price) as open, max(price) as high, min(price) as low, last(price) as close, sum(vol) as volume, (max(price)+min(price)+last(price))/3 as tp, last(upd_tick) as last_tick, now() as upd_1m]>, dummyTable=tmp, outputTable=sm1 , timeColumn=`ts, useSystemTime=true, keyColumn=`symbol, updateTime=60000, useWindowStartTime=false);
This is the function to convert cumulative volume to volume
def calcVolume(mutable dictVolume, mutable tsAggrOHLC, msg){
t = select ts,symbol,price,vol,upd_tick from msg context by symbol limit -1
update t set prevVolume = dictVolume[symbol]
dictVolume[t.symbol] = t.vol
tsAggrOHLC.append!(t.update!("vol", <vol-prevVolume>))
}
dictVol = dict(STRING, DOUBLE)
subscribeTable(tableName="t", actionName="OHLC_sm1", offset=0, handler=calcVolume{dictVol,OHLC_sm1}, msgAsTable=true, hash=1)
I recommend using ReactiveStateEngine to convert cumulative volume to volume and then connecting two engines in series. Here is an example:
tradesData = your_tick_data
//define Trade Table
x=tradesData.schema().colDefs
share streamTable(100:0, x.name, x.typeString) as Trade
//define OHLC outputTable
share streamTable(100:0, `datetime`symbol`open`high`low`close`volume`updatetime,[TIMESTAMP,SYMBOL,DOUBLE,DOUBLE,DOUBLE,DOUBLE,LONG,TIMESTAMP]) as OHLC
//1 min OHLC TimeSeriesEngine
tsAggrOHLC = createTimeSeriesAggregator(name="aggr_ohlc", windowSize=60000, step=60000, metrics=<[first(Price),max(Price),min(Price),last(Price),wavg(Price,Volume),now()]>, dummyTable=Trade, outputTable=OHLC, timeColumn=`Datetime, keyColumn=`Symbol)
//ReactiveStateEngine:convert cumulative volume to volume
rsAggrOHLC = createReactiveStateEngine(name="calc_vol", metrics=<[Datetime, Price, deltas(Volume) as Volume]>, dummyTable=Trade, outputTable=tsAggrOHLC, keyColumn=`Symbol)
//subscribe table and insert data into engines
subscribeTable(tableName="Trade", actionName="minuteOHLC2", offset=0, handler=append!{rsAggrOHLC}, msgAsTable=true)
replay(inputTables=tradesData, outputTables=Trade, dateColumn=`Datetime)
You can use user-defined functions in any of the engine's matrics.

How to see failed machine learning records

I am using the following code to create my machine learning model. The accuracy of the model is 0.76. I am just curious to know which records from my test data failed? Is there a way I can see those data?
// 1. Load the dataset for training and testing
var trainData = ctx.Data.LoadFromTextFile<SentimentData>(trainDataPath, hasHeader: true);
var testData = ctx.Data.LoadFromTextFile<SentimentData>(testDataPath, hasHeader: true);
// 2. Build a tranformer/estimator to transform input data so that Machine Learning algorithm can understand
IEstimator<ITransformer> estimator = ctx.Transforms.Text.FeaturizeText("Features", nameof(SentimentData.Text));
// 3. - set the training algorithm and create the pipeline for model builder
var trainer = ctx.BinaryClassification.Trainers.SdcaLogisticRegression();
var trainingPipeline = estimator.Append(trainer);
// 4. - Train the model
var trainedModel = trainingPipeline.Fit(trainData);
// 5. - Perform the preditions on the test data
var predictions = trainedModel.Transform(testData);
// 6. - Evalute the model
var metrics = ctx.BinaryClassification.Evaluate(data: predictions);
By using the GetColumn and CreateEnumerable methods, you can find the data that the model didn't predict correctly.
After you the metrics, use the GetColumn method on the predictions that were from the test data set to get the original label values. Then, use the CreateEnuemrable method to get the predictions that will hold the predicted values. Optionally, you can get the sentiment text as well.
var originalLabels = predictions.GetColumn<bool>("Label").ToArray();
var sentimentText = predictions.GetColumn<string>(nameof(SentimentData.SentimentText)).ToArray();
var predictedLabels = context.Data.CreateEnumerable<SentimentPrediction>(predictions, reuseRowObject: false).ToArray();
After getting the data, just loop through one of them (I did a count of the original labels) and you can access the data at each iteration. From there you can check if the actual label doesn't equal the predicted value to only print out the values that the model didn't get correctly.
for (int i = 0; i < originalLabels.Count(); i++)
{
string outputText = String.Empty;
if (originalLabels[i] != predictedLabels[i].Prediction)
{
outputText = $"Text - {sentimentText[i]} | ";
outputText += $"Original - {originalLabels[i]} | ";
outputText += $"Predicted - {predictedLabels[i].Prediction}";
Console.WriteLine(outputText);
}
}
With that you have the data that you need. :)
Hope that helps!
From your comment, I believe the method you are looking for can be found in the keras library. The method should be keras.models.predict_classes as found on their documentation page.
This will provide you with an array of predicted outputs, which you can then compare to the ground truths. Visit the documentation to see the parameters.
Hope this helps!

Derivative node in kapacitor batch

I am using derivative node to calculate bandwidth utilization of network devices, below is the script.
I am using where clause because i wanted alert for specific interface for specific Ip.
// database
var database = 'router'
// measurement from where data is coming
var measurement = 'cisco_router'
// RP from where data is coming
var RP = 'autogen'
// which influx cluster to use
var clus = 'network'
// durations
var period = 7m
var every = 10s
// alerts
var crit = 320
var alertName = 'cisco_router_bandwidth_alert'
var triggerType = 'threshold'
batch
|query(''' SELECT (mean("bandwidth_in") * 8) as "value" FROM "router"."autogen"."cisco_router" where host = '10.1.11.1' and ( interface_name = 'GigabitEthernet0/0/0' or interface_name = 'GigabitEthernet0/0/1') ''')
.cluster('network')
.period(7m)
.every(6m)
.groupBy(*)
|derivative('value')
.unit(1s)
.nonNegative()
.as('value')
|alert()
.crit(lambda: "value" > crit)
.stateChangesOnly()
.message(' {{.Level}} for {{ index .Tags "device_name" }} on Port {{ index .Tags "name" }} {{ .Time.Local.Format "2006.01.02 - 15:04:05" }} ')
.details('''
<pre>
------------------------------------------------------------------
CLIENT NAME : XXXXXXXX
ENVIRONMENT : Prod
DEVICE TYPE : Router
CATEGORY : {{ index .Tags "type" }}
IP ADDRESS : {{ index .Tags "host" }}
DATE : {{ .Time.Local.Format "2006.01.02 - 15:04:05" }}
INTERFACE NAME : {{ index .Tags "name" }}
VALUE : {{ index .Fields "value" }}
SEVERITY : {{.Level}}
------------------------------------------------------------------
</pre>
''')
.log('/tmp/chronograf/cisco_router_interface_alert.log')
.levelTag('level')
.idTag('id')
.messageField('message')
.email()
.to('XXXXXXX')
|influxDBOut()
.database('chronograf')
.retentionPolicy(RP)
.measurement('alerts')
.tag('alertName', alertName)
But it is not showing anything when i do kapacitor watch and not showing any errors in logs.
derivative() and some other nodes like stateDuration() kind of resets their state on each new batch query, in opposite to stream mode, where their state is kept whole time.
Actually, it is because in batch mode this nodes designed to track changes only inside the current batch of points.
Since your query returns single point - there is no result from derivative().
Try move derivative to the query. And use |httpOut() node to track results on each step - really helpful to understand kapacitor logic.
here is some example:
dbrp "telegraf"."autogen"
var q= batch
|query('SELECT derivative(mean("bytes_recv"), 1s) AS "bytes_recv_1s" FROM "telegraf"."autogen"."net" WHERE time < now() AND "interface"=\'eth0\' GROUP BY time(10m) fill(none)')
.period(10m)
.every(30s).align()
.groupBy(time(10m))
.fill('none')
|last('bytes_recv_1s').as('value')
|httpOut('query')
Note, there is a bugs associated with query parsing, that requires specify GROUP BY in both query and tick
https://github.com/influxdata/kapacitor/issues/971
https://github.com/influxdata/kapacitor/issues/622

Tensorflow: Separating Training and Evaluation Data in TFRecords

I have a .tfrecords file filled with labeled data. I'd like to use X% of them for training and (1-X)% for evaluation/testing. Obviously there shouldn't be any overlap. What is the best way of doing this?
Below is my small block of code for reading tfrecords. Is there some way I can get shuffle_batch to split the data into training and evaluation data? Am I going about this incorrectly?
reader = tf.TFRecordReader()
files = tf.train.string_input_producer([TFRECORDS_FILE], num_epochs=num_epochs)
read_name, serialized_examples = reader.read(files)
features = tf.parse_single_example(
serialized = serialized_examples,
features={
'image': tf.FixedLenFeature([], tf.string),
'value': tf.FixedLenFeature([], tf.string),
})
image = tf.decode_raw(features['image'], tf.uint8)
value = tf.decode_raw(features['value'], tf.uint8)
image, value = tf.train.shuffle_batch([image, value],
enqueue_many = False,
batch_size = 4,
capacity = 30,
num_threads = 3,
min_after_dequeue = 10)
Although the question was asked over a year ago, I had a similar question recently.
I used tf.data.Dataset with filters on input hash. Here is a sample:
dataset = tf.data.TFRecordDataset(files)
if is_evaluation:
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) == 0)
else:
dataset = dataset.filter(
lambda r: tf.string_to_hash_bucket_fast(r, 10) != 0)
dataset = dataset.map(tf.parse_single_example)
return dataset
One of the downsides that I have noticed so far that each evaluation may require data traversal 10x to collect enough data. To avoid this you may want to separate data at the data preprocessing time.

Polynomial regression in spark/ or external packages for spark

After investing good amount of searching on net for this topic, I am ending up here if I can get some pointer . please read further
After analyzing Spark 2.0 I concluded polynomial regression is not possible with spark (spark alone), so is there some extension to spark which can be used for polynomial regression?
- Rspark it could be done (but looking for better alternative)
- RFormula in spark does prediction but coefficients are not available (which is my main requirement as I primarily interested in coefficient values)
Polynomial regression is just another case of a linear regression (as in Polynomial regression is linear regression and Polynomial regression). As Spark has a method for linear regression, you can call that method changing the inputs in such a way that the new inputs are the ones suited to polynomial regression. For instance, if you only have one independent variable x, and you want to do quadratic regression, you have to change your independent input matrix for [x x^2].
I would like to add some information to #Mehdi Lamrani’s answer :
If you want to do a polynomial linear regression in SparkML, you may use the class PolynomialExpansion.
For information check the class in the SparkML Doc
or in the Spark API Doc
Here is an implementation example:
Let's assume we have a train and test datasets, stocked in two csv files, with headers containing the neames of the columns (features, label).
Each data set contains three features named f1,f2,f3, each of type Double (this is the X matrix), as well as a label feature (the Y vector) named mylabel.
For this code I used Spark+Scala:
Scala version : 2.12.8
Spark version 2.4.0.
We assume that SparkML library was already downloaded in build.sbt.
First of all, import librairies :
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.mllib.evaluation.RegressionMetrics
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions.udf
import org.apache.spark.{SparkConf, SparkContext}
Create Spark Session and Spark Context :
val ss = org.apache.spark.sql
.SparkSession.builder()
.master("local")
.appName("Read CSV")
.enableHiveSupport()
.getOrCreate()
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(conf)
Instantiate the variables you are going to use :
val f_train:String = "path/to/your/train_file.csv"
val f_test:String = "path/to/your/test_file.csv"
val degree:Int = 3 // Set the degree of your choice
val maxIter:Int = 10 // Set the max number of iterations
val lambda:Double = 0.0 // Set your lambda
val alpha:Double = 0.3 // Set the learning rate
First of all, let's create first several udf-s, which will be used for the data reading and pre-processing.
The arguments' types of the udf toFeatures will be Vector followed by the type of the arguments of the features: (Double,Double,Double)
val toFeatures = udf[Vector, Double, Double, Double] {
(a,b,c) => Vectors.dense(a,b,c)
}
val encodeIntToDouble = udf[Double, Int](_.toDouble)
Now let's create a function which extracts data from CSV and creates, new features from the existing ones, using PolynomialExpansion:
def getDataPolynomial(
currentfile:String,
sc:SparkSession,
sco:SparkContext,
degree:Int
):DataFrame =
{
val df_rough:DataFrame = sc.read
.format("csv")
.option("header", "true") //first line in file has headers
.option("mode", "DROPMALFORMED")
.option("inferSchema", value=true)
.load(currentfile)
.toDF("f1", "f2", "f3", "myLabel")
// you may add or not the last line
val df:DataFrame = df_rough
.withColumn("featNormTemp", toFeatures(df_rough("f1"), df_rough("f2"), df_rough("f3")))
.withColumn("label", Tools.encodeIntToDouble(df_rough("myLabel")))
val polyExpansion = new PolynomialExpansion()
.setInputCol("featNormTemp")
.setOutputCol("polyFeatures")
.setDegree(degree)
val polyDF:DataFrame=polyExpansion.transform(df.select("featNormTemp"))
val datafixedWithFeatures:DataFrame = polyDF.withColumn("features", polyDF("polyFeatures"))
val datafixedWithFeaturesLabel = datafixedWithFeatures
.join(df,df("featNormTemp") === datafixedWithFeatures("featNormTemp"))
.select("label", "polyFeatures")
datafixedWithFeaturesLabel
}
Now, run the function both for the train and test datasets, using the chosen degree for the Polynomial expansion.
val X:DataFrame = getDataPolynomial(f_train,ss,sc,degree)
val X_test:DataFrame = getDataPolynomial(f_test,ss,sc,degree)
Run the algorithm in order to get a model of linear regression, using a pipeline :
val assembler = new VectorAssembler()
.setInputCols(Array("polyFeatures"))
.setOutputCol("features2")
val lr = new LinearRegression()
.setMaxIter(maxIter)
.setRegParam(lambda)
.setElasticNetParam(alpha)
.setFeaturesCol("features2")
.setLabelCol("label")
// Fit the model:
val pipeline:Pipeline = new Pipeline().setStages(Array(assembler,lr))
val lrModel:PipelineModel = pipeline.fit(X)
// Get prediction on the test set :
val result:DataFrame = lrModel.transform(X_test)
Finally, evaluate the result using mean squared error measure :
def leastSquaresError(result:DataFrame):Double = {
val rm:RegressionMetrics = new RegressionMetrics(
result
.select("label","prediction")
.rdd
.map(x => (x(0).asInstanceOf[Double], x(1).asInstanceOf[Double])))
Math.sqrt(rm.meanSquaredError)
}
val error:Double = leastSquaresError(result)
println("Error : "+error)
I hope this might be useful !

Resources