Pattern recognition in time series [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
By processing a time series graph, I Would like to detect patterns that look similar to this:
Using a sample time series as an example, I would like to be able to detect the patterns as marked here:
What kind of AI algorithm (I am assuming marchine learning techniques) do I need to use to achieve this? Is there any library (in C/C++) out there that I can use?

Here is a sample result from a small project I did to partition ecg data.
My approach was a "switching autoregressive HMM" (google this if you haven't heard of it) where each datapoint is predicted from the previous datapoint using a Bayesian regression model. I created 81 hidden states: a junk state to capture data between each beat, and 80 separate hidden states corresponding to different positions within the heartbeat pattern. The pattern 80 states were constructed directly from a subsampled single beat pattern and had two transitions - a self transition and a transition to the next state in the pattern. The final state in the pattern transitioned to either itself or the junk state.
I trained the model with Viterbi training, updating only the regression parameters.
Results were adequate in most cases. A similarly structure Conditional Random Field would probably perform better, but training a CRF would require manually labeling patterns in the dataset if you don't already have labelled data.
Edit:
Here's some example python code - it is not perfect, but it gives the general approach. It implements EM rather than Viterbi training, which may be slightly more stable.
The ecg dataset is from http://www.cs.ucr.edu/~eamonn/discords/ECG_data.zip
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
import scipy.linalg as lin
import re
data=np.array(map(lambda l: map(float, filter(lambda x:len(x)>0,
re.split('\\s+',l))), open('chfdb_chf01_275.txt'))).T
dK=230
pattern=data[1,:dK]
data=data[1,dK:]
def create_mats(dat):
'''
create
A - an initial transition matrix
pA - pseudocounts for A
w - emission distribution regression weights
K - number of hidden states
'''
step=5 #adjust this to change the granularity of the pattern
eps=.1
dat=dat[::step]
K=len(dat)+1
A=np.zeros( (K,K) )
A[0,1]=1.
pA=np.zeros( (K,K) )
pA[0,1]=1.
for i in xrange(1,K-1):
A[i,i]=(step-1.+eps)/(step+2*eps)
A[i,i+1]=(1.+eps)/(step+2*eps)
pA[i,i]=1.
pA[i,i+1]=1.
A[-1,-1]=(step-1.+eps)/(step+2*eps)
A[-1,1]=(1.+eps)/(step+2*eps)
pA[-1,-1]=1.
pA[-1,1]=1.
w=np.ones( (K,2) , dtype=np.float)
w[0,1]=dat[0]
w[1:-1,1]=(dat[:-1]-dat[1:])/step
w[-1,1]=(dat[0]-dat[-1])/step
return A,pA,w,K
# Initialize stuff
A,pA,w,K=create_mats(pattern)
eta=10. # precision parameter for the autoregressive portion of the model
lam=.1 # precision parameter for the weights prior
N=1 #number of sequences
M=2 #number of dimensions - the second variable is for the bias term
T=len(data) #length of sequences
x=np.ones( (T+1,M) ) # sequence data (just one sequence)
x[0,1]=1
x[1:,0]=data
# Emissions
e=np.zeros( (T,K) )
# Residuals
v=np.zeros( (T,K) )
# Store the forward and backward recurrences
f=np.zeros( (T+1,K) )
fls=np.zeros( (T+1) )
f[0,0]=1
b=np.zeros( (T+1,K) )
bls=np.zeros( (T+1) )
b[-1,1:]=1./(K-1)
# Hidden states
z=np.zeros( (T+1),dtype=np.int )
# Expected hidden states
ex_k=np.zeros( (T,K) )
# Expected pairs of hidden states
ex_kk=np.zeros( (K,K) )
nkk=np.zeros( (K,K) )
def fwd(xn):
global f,e
for t in xrange(T):
f[t+1,:]=np.dot(f[t,:],A)*e[t,:]
sm=np.sum(f[t+1,:])
fls[t+1]=fls[t]+np.log(sm)
f[t+1,:]/=sm
assert f[t+1,0]==0
def bck(xn):
global b,e
for t in xrange(T-1,-1,-1):
b[t,:]=np.dot(A,b[t+1,:]*e[t,:])
sm=np.sum(b[t,:])
bls[t]=bls[t+1]+np.log(sm)
b[t,:]/=sm
def em_step(xn):
global A,w,eta
global f,b,e,v
global ex_k,ex_kk,nkk
x=xn[:-1] #current data vectors
y=xn[1:,:1] #next data vectors predicted from current
# Compute residuals
v=np.dot(x,w.T) # (N,K) <- (N,1) (N,K)
v-=y
e=np.exp(-eta/2*v**2,e)
fwd(xn)
bck(xn)
# Compute expected hidden states
for t in xrange(len(e)):
ex_k[t,:]=f[t+1,:]*b[t+1,:]
ex_k[t,:]/=np.sum(ex_k[t,:])
# Compute expected pairs of hidden states
for t in xrange(len(f)-1):
ex_kk=A*f[t,:][:,np.newaxis]*e[t,:]*b[t+1,:]
ex_kk/=np.sum(ex_kk)
nkk+=ex_kk
# max w/ respect to transition probabilities
A=pA+nkk
A/=np.sum(A,1)[:,np.newaxis]
# Solve the weighted regression problem for emissions weights
# x and y are from above
for k in xrange(K):
ex=ex_k[:,k][:,np.newaxis]
dx=np.dot(x.T,ex*x)
dy=np.dot(x.T,ex*y)
dy.shape=(2)
w[k,:]=lin.solve(dx+lam*np.eye(x.shape[1]), dy)
# Return the probability of the sequence (computed by the forward algorithm)
return fls[-1]
if __name__=='__main__':
# Run the em algorithm
for i in xrange(20):
print em_step(x)
# Get rough boundaries by taking the maximum expected hidden state for each position
r=np.arange(len(ex_k))[np.argmax(ex_k,1)<3]
# Plot
plt.plot(range(T),x[1:,0])
yr=[np.min(x[:,0]),np.max(x[:,0])]
for i in r:
plt.plot([i,i],yr,'-r')
plt.show()

Why not using a simple matched filter? Or its general statistical counterpart called cross correlation. Given a known pattern x(t) and a noisy compound time series containing your pattern shifted in a,b,...,z like y(t) = x(t-a) + x(t-b) +...+ x(t-z) + n(t). The cross correlation function between x and y should give peaks in a,b, ...,z

Weka is a powerful collection of machine-learning software, and supports some time-series analysis tools, but I do not know enough about the field to recommend a best method. However, it is Java-based; and you can call Java code from C/C++ without great fuss.
Packages for time-series manipulation are mostly directed at the stock-market. I suggested Cronos in the comments; I have no idea how to do pattern recognition with it, beyond the obvious: any good model of a length of your series should be able to predict that, after small bumps at a certain distance to the last small bump, big bumps follow. That is, your series exhibits self-similarity, and the models used in Cronos are designed to model it.
If you don't mind C#, you should request a version of TimeSearcher2 from the folks at HCIL - pattern recognition is, for this system, drawing what a pattern looks like, and then checking whether your model is general enough to capture most instances with a low false-positive rate. Probably the most user-friendly approach you will find; all others require quite a background in statistics or pattern recognition strategies.

I'm not sure what package would work best for this. I did something similar at one point in college where I tried to automatically detect certain similar shapes on an x-y axis for a bunch of different graphs. You could do something like the following.
Class labels like:
no class
start of region
middle of region
end of region
Features like:
relative y-axis relative and absolute difference of each of the
surrounding points in a window 11 points wide
Features like difference from average
Relative difference between point before, point after

I am using deep learning if it's an option for you. It's done in Java, Deeplearning4j. I am experimenting with LSTM. I tried 1 hidden layer and 2 hidden layers to process time series.
return new NeuralNetConfiguration.Builder()
.seed(HyperParameter.seed)
.iterations(HyperParameter.nItr)
.miniBatch(false)
.learningRate(HyperParameter.learningRate)
.biasInit(0)
.weightInit(WeightInit.XAVIER)
.momentum(HyperParameter.momentum)
.optimizationAlgo(
OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT // RMSE: ????
)
.regularization(true)
.updater(Updater.RMSPROP) // NESTEROVS
// .l2(0.001)
.list()
.layer(0,
new GravesLSTM.Builder().nIn(HyperParameter.numInputs).nOut(HyperParameter.nHNodes_1).activation("tanh").build())
.layer(1,
new GravesLSTM.Builder().nIn(HyperParameter.nHNodes_1).nOut(HyperParameter.nHNodes_2).dropOut(HyperParameter.dropOut).activation("tanh").build())
.layer(2,
new GravesLSTM.Builder().nIn(HyperParameter.nHNodes_2).nOut(HyperParameter.nHNodes_2).dropOut(HyperParameter.dropOut).activation("tanh").build())
.layer(3, // "identity" make regression output
new RnnOutputLayer.Builder(LossFunctions.LossFunction.MSE).nIn(HyperParameter.nHNodes_2).nOut(HyperParameter.numOutputs).activation("identity").build()) // "identity"
.backpropType(BackpropType.TruncatedBPTT)
.tBPTTBackwardLength(100)
.pretrain(false)
.backprop(true)
.build();
Found a few things:
LSTM or RNN is very good at picking out patterns in time-series.
Tried on one time-series, and a group different time-series. Pattern were picked out easily.
It is also trying to pick out patterns not for just one cadence. If there are patterns by week, and by month, both will be learned by the net.

Related

Feedback in NaiveBayes Text Classification

I am a newbie in machine Learning, i am building a complaint categorizer and i want to provide a feedback model so that it can improve over time
import numpy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
value=[
'drought',
'robber',
]
targets=[
'water_department',
'police_department',
]
classifier = MultinomialNB()
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(value)
classifier.partial_fit(counts[:1], targets[:1],classes=numpy.unique(targets))
for c,t in zip(counts[1:],targets[1:]):
classifier.partial_fit(c, t.split())
value.append('dogs') #new value to train
targets.append('animal_department') #new target
vectorize = CountVectorizer()
counts = vectorize.fit_transform(value)
print counts
print targets
print vectorize.vocabulary_
####problem lies here
classifier.partial_fit(counts["""dont know the index of new value"""], targets[-1])
####problem lies here
Even if i somehow find the index of newly inserted value, it is giving the error
ValueError: Number of features 3 does not match previous data 2.
even thought i made it to insert one value at a time
I will try to answer the question from a general point of view. There are two sources of problem in the Naive Bayes (NB) approach described here:
Out-of-vocabulary (OOV) problem
Incremental training of NB
OOV problem: The simplest way to tackle the OOV problem is to decompose every word into character 3 grams. How many such 3-grams are possible? Assuming lower-casing there are only 26 possible ways to fill each place and hence the total number of possible character 3-grams is 26^3=17576, which is significantly lower than the number of possible English words that you're likely to see in text.
Hence, generally speaking, while training NB, a good idea is to use probabilities of character n-grams (n=3,4,5). This will drastically reduce the OOV problem.
Incremental training: For incremental training, given a new sentence decompose it into terms (character n-grams). Update the count of of each term for its corresponding observed class label. For example, if count(t,c) denotes how many times was the term t observed in class c, simply update the count if you see t in class 0 (or class 1) during incremental training. Updating the counts will update the maximum likelihood probability estimates as well.

Best strategy to reduce false positives: Google's new Object Detection API on Satellite Imagery

I'm setting up the new Tensorflow Object Detection API to find small objects in large areas of satellite imagery. It works quite well - it finds all 10 objects I want, but I also get 50-100 false positives [things that look a little like the target object, but aren't].
I'm using the sample config from the 'pets' tutorial, to fine-tune the faster_rcnn_resnet101_coco model they offer. I've started small, with only 100 training examples of my objects (just 1 class). 50 examples in my validation set. Each example is a 200x200 pixel image with a labeled object (~40x40) in the center. I train until my precision & loss curves plateau.
I'm relatively new to using deep learning for object detection. What is the best strategy to increase my precision? e.g. Hard-negative mining? Increase my training dataset size? I've yet to try the most accurate model they offer faster_rcnn_inception_resnet_v2_atrous_coco as i'd like to maintain some speed, but will do so if needed.
Hard-negative mining seems to be a logical step. If you agree, how do I implement it w.r.t setting up the tfrecord file for my training dataset? Let's say I make 200x200 images for each of the 50-100 false positives:
Do I create 'annotation' xml files for each, with no 'object' element?
...or do I label these hard negatives as a second class?
If I then have 100 negatives to 100 positives in my training set - is that a healthy ratio? How many negatives can I include?
I've revisited this topic recently in my work and thought I'd update with my current learnings for any who visit in the future.
The topic appeared on Tensorflow's Models repo issue tracker. SSD allows you to set the ratio of how many negative:postive examples to mine (max_negatives_per_positive: 3), but you can also set a minimum number for images with no postives (min_negatives_per_image: 3). Both of these are defined in the model-ssd-loss config section.
That said, I don't see the same option in Faster-RCNN's model configuration. It's mentioned in the issue that models/research/object_detection/core/balanced_positive_negative_sampler.py contains the code used for Faster-RCNN.
One other option discussed in the issue is creating a second class specifically for lookalikes. During training, the model will attempt to learn class differences which should help serve your purpose.
Lastly, I came across this article on Filter Amplifier Networks (FAN) that may be informative for your work on aerial imagery.
===================================================================
The following paper describes hard negative mining for the same purpose you describe:
Training Region-based Object Detectors with Online Hard Example Mining
In section 3.1 they describe using a foreground and background class:
Background RoIs. A region is labeled background (bg) if its maximum
IoU with ground truth is in the interval [bg lo, 0.5). A lower
threshold of bg lo = 0.1 is used by both FRCN and SPPnet, and is
hypothesized in [14] to crudely approximate hard negative mining; the
assumption is that regions with some overlap with the ground truth are
more likely to be the confusing or hard ones. We show in Section 5.4
that although this heuristic helps convergence and detection accuracy,
it is suboptimal because it ignores some infrequent, but important,
difficult background regions. Our method removes the bg lo threshold.
In fact this paper is referenced and its ideas are used in Tensorflow's object detection losses.py code for hard mining:
class HardExampleMiner(object):
"""Hard example mining for regions in a list of images.
Implements hard example mining to select a subset of regions to be
back-propagated. For each image, selects the regions with highest losses,
subject to the condition that a newly selected region cannot have
an IOU > iou_threshold with any of the previously selected regions.
This can be achieved by re-using a greedy non-maximum suppression algorithm.
A constraint on the number of negatives mined per positive region can also be
enforced.
Reference papers: "Training Region-based Object Detectors with Online
Hard Example Mining" (CVPR 2016) by Srivastava et al., and
"SSD: Single Shot MultiBox Detector" (ECCV 2016) by Liu et al.
"""
Based on your model config file, the HardMinerObject is returned by losses_builder.py in this bit of code:
def build_hard_example_miner(config,
classification_weight,
localization_weight):
"""Builds hard example miner based on the config.
Args:
config: A losses_pb2.HardExampleMiner object.
classification_weight: Classification loss weight.
localization_weight: Localization loss weight.
Returns:
Hard example miner.
"""
loss_type = None
if config.loss_type == losses_pb2.HardExampleMiner.BOTH:
loss_type = 'both'
if config.loss_type == losses_pb2.HardExampleMiner.CLASSIFICATION:
loss_type = 'cls'
if config.loss_type == losses_pb2.HardExampleMiner.LOCALIZATION:
loss_type = 'loc'
max_negatives_per_positive = None
num_hard_examples = None
if config.max_negatives_per_positive > 0:
max_negatives_per_positive = config.max_negatives_per_positive
if config.num_hard_examples > 0:
num_hard_examples = config.num_hard_examples
hard_example_miner = losses.HardExampleMiner(
num_hard_examples=num_hard_examples,
iou_threshold=config.iou_threshold,
loss_type=loss_type,
cls_loss_weight=classification_weight,
loc_loss_weight=localization_weight,
max_negatives_per_positive=max_negatives_per_positive,
min_negatives_per_image=config.min_negatives_per_image)
return hard_example_miner
which is returned by model_builder.py and called by train.py. So basically, it seems to me that simply generating your true positive labels (with a tool like LabelImg or RectLabel) should be enough for the train algorithm to find hard negatives within the same images. The related question gives an excellent walkthrough.
In the event you want to feed in data that has no true positives (i.e. nothing should be classified in the image), just add the negative image to your tfrecord with no bounding boxes.
I think I was passing through the same or close scenario and it's worth it to share with you.
I managed to solve it by passing images without annotations to the trainer.
On my scenario I'm building a project to detect assembly failures from my client's products, at real time.
I successfully achieved very robust results (for production env) by using detection+classification for components that has explicity a negative pattern (e.g. a screw that has screw on/off(just the hole)) and only detection for things that doesn't has the negative pattens (e.g. a tape that can be placed anywhere).
On the system it's mandatory that the user record 2 videos, one containing the positive scenario and another containing the negative (or the n videos, containing n patterns of positive and negative so the algorithm can generalize).
After a while testing I found out that if I register to detected only tape the detector was giving very confident (0.999) false positive detections of tape. It was learning the pattern where the tape was inserted instead of the tape itself. When I had another component (like a screw on it's negative format) I was passing the negative pattern of tape without being explicitly aware of it, so the FPs didn't happen.
So I found out that, in this scenario, I had to necessarily pass the images without tape so it could differentiate between tape and no-tape.
I considered two alternatives to experiment and try to solve this behavior:
Train passing an considerable amount of images that doesn't has any annotation (10% of all my negative samples) along with all images that I have real annotations.
On the images that I don't have annotation I create a dummy annotation with a dummy label so I could force the detector to train with that image (thus learning the no-tape pattern). Later on, when get the dummy predictions, just ignore them.
Concluded that both alternatives worked perfectly on my scenario.
The training loss got a little messy but the predictions work with robustness for my very controlled scenario (the system's camera has its own box and illumination to decrease variables).
I had to make two little modifications for the first alternative to work:
All images that didn't had any annotation I passed a dummy annotation (class=None, xmin/ymin/xmax/ymax=-1)
When generating the tfrecord files I use this information (xmin == -1, in this case) to add an empty list for the sample:
def create_tf_example(group, path, label_map):
with tf.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
encoded_jpg = fid.read()
encoded_jpg_io = io.BytesIO(encoded_jpg)
image = Image.open(encoded_jpg_io)
width, height = image.size
filename = group.filename.encode('utf8')
image_format = b'jpg'
xmins = []
xmaxs = []
ymins = []
ymaxs = []
classes_text = []
classes = []
for index, row in group.object.iterrows():
if not pd.isnull(row.xmin):
if not row.xmin == -1:
xmins.append(row['xmin'] / width)
xmaxs.append(row['xmax'] / width)
ymins.append(row['ymin'] / height)
ymaxs.append(row['ymax'] / height)
classes_text.append(row['class'].encode('utf8'))
classes.append(label_map[row['class']])
tf_example = tf.train.Example(features=tf.train.Features(feature={
'image/height': dataset_util.int64_feature(height),
'image/width': dataset_util.int64_feature(width),
'image/filename': dataset_util.bytes_feature(filename),
'image/source_id': dataset_util.bytes_feature(filename),
'image/encoded': dataset_util.bytes_feature(encoded_jpg),
'image/format': dataset_util.bytes_feature(image_format),
'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
'image/object/class/label': dataset_util.int64_list_feature(classes),
}))
return tf_example
Part of the traning progress:
Currently I'm using tensorflow object detection along with tensorflow==1.15, using faster_rcnn_resnet101_coco.config.
Hope it will solve someone's problem as I didn't found any solution on the internet. I read a lot of people telling that faster_rcnn is not adapted for negative training for FPs reduction but my tests proved the opposite.

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

how can fixed parameters cost and gamma using libsvm matlab to improve accuracy?

I use libsvm to classify a data base that contain 1000 labels. I am new in libsvm and I found a problem to choose the parameters c and g to improve performance. First, here is the program that I use to set the parameters:
bestcv = 0;
for log2c = -1:3,
for log2g = -4:1,
cmd = ['-v 5 -c ', num2str(2^log2c), ' -g ', num2str(2^log2g)];
cv = svmtrain(yapp, xapp, cmd);
if (cv >= bestcv),
bestcv = cv; bestc = 2^log2c; bestg = 2^log2g;
end
fprintf('%g %g %g (best c=%g, g=%g, rate=%g)\n', log2c, log2g, cv, bestc, bestg, bestcv);
end
end
as a result, this program gives c = 8 and g = 2 and when I use these values
c and g, I found an accuracy rate of 55%. for classification, I use svm one against all.
numLabels=max(yapp);
numTest=size(ytest,1);
%# train one-against-all models
model = cell(numLabels,1);
for k=1:numLabels
model{k} = svmtrain(double(yapp==k),xapp, ' -c 1000 -g 10 -b 1 ');
end
%# get probability estimates of test instances using each model
prob_black = zeros(numTest,numLabels);
for k=1:numLabels
[~,~,p] = svmpredict(double(ytest==k), xtest, model{k}, '-b 1');
prob_black(:,k) = p(:,model{k}.Label==1); %# probability of class==k
end
%# predict the class with the highest probability
[~,pred_black] = max(prob_black,[],2);
acc = sum(pred_black == ytest) ./ numel(ytest) %# accuracy
The problem is that I need to change these parameters to increase performance. for example, when I put randomly c = 10000 and g = 100, I found a better accuracy rate: 70%.
Please I need help, how can I set theses parameters ( c and g) so to find the optimum accuracy rate? thank you in advance
Hyperparameter tuning is a nontrivial problem in machine learning. The simplest approach is what you've already implemented: define a grid of values, and compute the model on the grid until you find some optimal combination. A key assumption is that the grid itself is a good approximation of the surface: that it's fine enough to not miss anything important, but not so fine that you waste time computing values that are essentially the same as neighboring values. I'm not aware of any method to, in general, know ahead of time how fine a grid is necessary. As illustration: imagine that the global optimum is at $(5,5)$ and the function is basically flat elsewhere. If your grid is $(0,0),(0,10),(10,10),(0,10)$, you'll miss the optimum completely. Likewise, if the grid is $(0,0), (-10,-10),(-10,0),(0,-10)$, you'll never be anywhere near the optimum. In both cases, you have no hope of finding the optimum itself.
Some rules of thumb exist for SVM with RBF kernels, though: a grid of $\gamma\in\{2^{-15},2^{-14},...,2^5\}$ and $C \in \{2^{-5}, 2^{-4},...,2^{15}\}$ is one such recommendation.
If you found a better solution outside of the range of grid values that you tested, this suggests you should define a larger grid. But larger grids take more time to evaluate, so you'll either have to commit to waiting a while for your results, or move to a more efficient method of exploring the hyperparameter space.
Another alternative is random search: define a "budget" of the number of SVMs that you want to try out, and generate that many random tuples to test. This approach is mostly just useful for benchmarking purposes, since it's entirely unintelligent.
Both grid search and random search have the advantage of being stupidly easy to implement in parallel.
Better options fall in the domain of global optimization. Marc Claeson et al have devised the Optunity package, which uses particle swarm optimization. My research focuses on refinements of the Efficient Global Optimization algorithm (EGO), which builds up a Gaussian process as an approximation of the hyperparameter response surface and uses that to make educated predictions about which hyperparameter tuples are most likely to improve upon the current best estimate.
Imagine that you've evaluated the SVM at some hyperparameter tuple $(\gamma, C)$ and it has some out-of-sample performance metric $y$. An advantage to EGO-inspired methods is that it assumes that the values $y^*$ nearby $(\gamma,C)$ will be "close" to $y$, so we don't necessarily need to spend time exploring those tuples nearby, especially if $y-y_{min}$ is very large (where $y_{min}$ is the smallest $y$ value we've discovered). EGO will identify and evaluate the SVM at points where it estimates there is a high probability of improvement, so it will intelligently move through the hyper-parameter space: in the ideal case, it will skip over regions of low performance in favor of focusing on regions of high performance.

Recommended anomaly detection technique for simple, one-dimensional scenario?

I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier

Resources