detect a peak in a continuous feed of numerical data - machine-learning

This is a "very simplified" example of what I'm trying to do and I'm new to machine learning.
I have continuous feed of numerical data coming in as input. And I want to detect a specific change in the data, like a peak in a heart beat monitor.
How would I go about to accomplish this using machine learning ?

It depends what you are calling a peak. You can try something like this, if you call a peak value rising on 10 points or more:
import numpy as np
min_peak_diff = 10
arr = np.array([3,5,10,42,4,3,6,66,8,12,7,5])
ind = np.add(np.where(arr[1:] >= arr[:-1] + min_peak_diff), 1)
print('peak indexes:', ind)
print('peak values:', arr[ind])
Result:
peak indexes: [[3 7]]
peak values: [[42 66]]
But if you really want to use machine learning methods you can look at:
scikit-learn page about this topic (it has plenty of methods) https://scikit-learn.org/stable/modules/outlier_detection
Python Outlier Detection library https://github.com/yzhao062/pyod

Related

Derive the right k in k-means clustering (including k = 1) in pyspark

I want to check if a clustering would be helpful or not on my coordinates.
I'm dealing with trajectories and want to check if all of them are starting on a same area (the trajectories are different). Thus the aim here is to characterise the most frequent departure points.
However, sometimes there is no need for clustering. I'm using K-means here. I had thought of using the Silhouette Score but I don't see if it is mathematically correct for the case where there is only one cluster. DBScan will not be a good clustering as density are not similar in the clusters I wanted to build.
Would you have an idea to create a kind of check between k=1 and k=3, which would be the best split for my data? I'm dealing here with data with coordinates (latitude/longitude) where the starting point is not 100% fixed but can vary within 2km around a kind of barycentre.
Simple extract with k=2 :
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "lon"], outputCol="features")
df1= vecAssembler.transform(df)
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
# Loads data.
# Trains a k-means model.
kmeans = KMeans().setK(2).setSeed(1)
model = kmeans.fit(df1.select('features'))
# Make predictions
transformed = model.transform(df1)
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='features', \
metricName='silhouette', distanceMeasure='squaredEuclidean')
evaluator.evaluate(transformed)
Is there a way to compute in pySpark a case with k=1 ? in order to derive Elbow or gap statistics ?

How to add Reduce Learning Rate On Plateau in XGBoost?

You know in TensorFlow, we use a callback named ReduceLROnPlateau which reduce our learning rate slightly when our model stops learning. Does anyone know how to do this in XGBoost? I want to know if there is any way to reduce learning when XGBoost model stops learning.
This article claims:
A quick Google search reveals that there has been some work done utilizing a decaying learning rate, one that starts out large and shrinks at each round. But we typically see no accuracy gains in Cross Validation and when looking at test error graphs there are minimal differences between its performance and using a regular old constant.
They wrote a Python package called BetaBoost to find an optimal sequence for the learning rate scheduler.
In principle they seem to use a function which returns learning rates for the LearningRateScheduler
from scipy.stats import beta
def beta_pdf(scalar=1.5,
a=26,
b=1,
scale=80,
loc=-68,
floor=0.01,
n_boosting_rounds=100):
"""
Get the learning rate from the beta PDF
Returns
-------
lrs : list
the resulting learning rates to use.
"""
lrs = [scalar*beta.pdf(i,
a=a,
b=b,
scale=scale,
loc=loc)
+ floor for i in range(n_boosting_rounds)]
return lrs
[...]
xgb.train(
[...],
callbacks=[xgb.callback.LearningRateScheduler(beta_pdf())]
)

machine learning algorithm with list of arrays as data set

i'm new in data science and i'm searching for machine learning algorithm that take data set as List of arrays each array have sequence of floats data
A little bit of context: we have some angels that took from user motion ,
by these angels we determines if the user make the correct motion or not ,
the motion represented in our system in list of array each array has sequence of angels
any help please ? i searched for a lot of time but have no result !
Check out neupy. It is a great library for new machine learning users. I would suggest just the standard back propagation algorithm with momentum. It has been proven that newer adaptive learning techniques don't do as well as the simple gradient back propagation algorithm with momentum.
It is easy to implement. It would be implemented for example using the following code,
A: Create data set
x = np.zeros((len(list[0]),len(list)))
for i in np.arange(len(list)):
for j in np.arange(len(list[0]):
x[i][j] = list[i][j]
This would be the input. Then you create the architecture
B: Create Architecture
network = layers.Input(len(list[0])) > layers.Sigmoid(int(len(list[0])/2)) > layers.Sigmoid(2)
C: Use Gradient Descent With Momentum
gdnet = layers.Algorithms.Momentum(network,momentum=0.1)
gdnet.train(x,y, max_iter=1000)
Where y is the movement of interest.
D: Predict Motion
y_predicted = gdnet(x)
In general, most libraries take in numpy arrays as inputs.
There are a number of ways to wrangle your data into that format. I find pandas (https://pandas.pydata.org/pandas-docs/stable/) to be the most convenient way. If you have the data in .csv file, excel sheet or some other common, structured format, pandas has functions for loading that in with no pain at all
If you give some more details (Are you using a machine learning library (like sci-kit), what format the data is in) i can be of more help.

Classification with numerical label?

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

how can fixed parameters cost and gamma using libsvm matlab to improve accuracy?

I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.

Resources