Esper provides a function of linear regression [stat:linest(,)].
Example looks like (and this works great):
select * from StockTickEvent.win:time(10 seconds).stat:linest(price, offer)
However, I am trying to get a linear regression of all data in window grouped by symbol (say, INTC) and it does not allow me. I tried using "having symbol='GE'" and that was not right either. Here is what I tried to d:
select * from StockTickEvent.win:time(10 seconds).stat:linest(current_timestamp(), price) group by symbol
Any help to resolve this is appreciated.
For grouping data windows the "std:groupwin" is the right approach.
Related
I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX
I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!
I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.
I'm studying SVM and implemented this code , it's too basic,primitive and taking too much time but I just wanted to see how it actually works.Unfortunately,it is giving me bad results.What did I miss? Some coding error or mathematical mistakes? If you want to look at dataset , it's link here. I taked it from UCI Machine Learning Repository. Thanks for your deal.
def hypo(x,q):
return 1/(1+np.exp(-x.dot(q)))
data=np.loadtxt('LSVTVoice',delimiter='\t');
x=np.ones(data.shape)
x[:,1:]=data[:,0:data.shape[1]-1]
y=data[:,data.shape[1]-1]
q=np.zeros(data.shape[1])
C=0.002
##mean normalization
for i in range(q.size-1):
x[:,i+1]=(x[:,i+1]-x[:,i+1].mean())/(x[:,i+1].max()-x[:,i+1].min());
for i in range(2000):
h=x.dot(q)
for j in range(q.size):
q[j]=q[j]-(C*np.sum( -y*np.log(hypo(x,q))-(1-y)*np.log(1-hypo(x,q))) ) + (0.5*np.sum(q**2))
for i in range(y.size):
if h[i]>=0:
print y[i],'1'
else:
print y[i],'0'
Depending on your data, it's very usual that Simple Implementation of SVM give you bad result. You must try advanced version on SVM implementation(e.g Sickit SVM) you can also check this: https://github.com/scikit-learn/scikit-learn/tree/master/sklearn/svm
SVM has types of implementation and parameters like different kernels(e.g rbf). You must check them and try them with different parameter(depending on your data) and compare results to each other.
You can use Grid Search approach for comparing(check this: http://scikit-learn.org/stable/modules/grid_search.html)
I am trying to implement a Support Vector Machine to understand in and out of it but I am stuck on how to implement it.
Everywhere it is explained how to get a hyper-plane such that we are able to separate different classes. My question is how to get the data to Feature Space Y from Input Space I.
Like for example consider below data:
date userId pc activity
01/04/2010 07:12:31 RES0962 PC-3736 Connect
01/04/2010 07:35:40 RES0962 PC-2588 Disconnect
01/04/2010 08:02:14 ZKH0388 PC-1021 Connect
01/04/2010 08:20:17 ZKH0388 PC-3736 Disconnect
Q) Assuming we are trying to build a User behavior model. We can extract features of each user and use it to train but in terms of code how its working? I have no idea about that. If someone could explain that it would be of great help.
Mapping to feature space requires you to have a weight for each of the distinct feature that determine the classes of your input. Getting the weight is a function of clearly understood the theoretical basis of your project e.g Your financial worth is determined by Money in bank and Investment. The weight of money in bank might be 2; while for investment mightt be 5. therefore, somebody with more investment and less money will likely be with more networths.
Now, the two features money in bank and investment will now be treated as a cordinate x and y respectively as you wished for each inputed data(of course with two features). Imagine you plot the graph after knowing each data (x, y) cordinate based on your weight. Then, getting the hyperplane will be the next challenge. I hope this help. Good luck
first time user of this forum - guidance on how to provide enough information is very appreciated. I am trying to replicate the presentation of data used in the Medical education field. This will help improve the quality of examiners' marking of trainees in a Clinical Exam. What I would like to communicate will be similar to what is already communicated in the College of General Practitioners regarding one of their own exams, please see www.gp10.com.au/slides/thursday/slide29.pdf to help understand what it is I want to present. I have access to Excel, SPSS and R, so any help with any of these would be great. However as a first attempt I have used SPSS and created 3 variables: dummy variable, a "station score" and a "global rating score"(GRS). The "station score"(ST) is a value between 0 and 10 (non-integers) and is on the y-axis similar to the pdf presentation of "Candidate Final Marks". The x-axis is the "global rating scale", an integer from 1 to 6 and is represented in the pdf as the "Overall Performance Scale". When I use SPSS's boxplot I get a boxplot as depicted.
.
What I would like to do is overlay a single examiners own scoring of X number of examinees. So for one examiner (examiner A) provided the following marks:
ST: 5.53,7.38,7.38,7.44,6.81
GRS: 3,4,4,5,3
(this is transposed into two columns).
Whether it be SPSS, Excel or R how would I be able to overlay the box and whisker plots with the individual data points provided by the one examiner? This would help show the degree to which the examiners' marking styles are in concordance with the expected distribution of ST scores across GRS. Any help greatly appreciated! I like Excel graphics but I have found it very difficult to work with when choosing the examiners' data as a separate series - somehow the examiners' GRS scores do not line up nicely on the x-axis. I am very new to R but am also very interested in R, and would expend time to get a good result in R if a good result is viable. I understand JMP may be preferable for this type of thing but access to this may not be possible.