What is a proper Learning Technique for the given data sample - machine-learning

I am working in matlab.
I have data samples of two unrelated variables at 256 time-steps. Their plots with their value on Y - axis and time-steps on X-axis is as below.
Typical Plot for the first variable say Pos is
Typical Plot for the second variable say Vel is
Now I need to predict the values for these variables at next 10 time-steps. To check various machine learning techniques to do so , I took values of the variables at first 246 time-steps , predicted the next 10 time-steps and then compared them with their actual value by calculating the mean square error say ms_error.
I have done this using time-series(NAR) ,linear regression,fuzzy input systems,neural networks. but none of these are able to give the value of ms_error lesser than 2.
Can someone suggest a learning algorithm to use to predict future values for data samples like these two.

You could try with symbolic regression via Genetic Programming.
Genetic Programing makes no assumption about the structure of the function that fits your data points, so it's well suited to this sort of discovery task.
Symbolic regression was one of the earliest applications of GP and continues to be widely studied.
There are many ready-to-use environments for every major programming language and many tutorials on the subject e.g.
C++: Beagle
Java: ECJ
Matlab: GPTIPS - GPLAB
Python: DEAP
(I don't mean these are the best, just well known. Of course Google search could bring up other software that is more suitable for your needs).

Related

Gaussian Progress Regression usecase

while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.

an algorithm for clustering visually separable clusters

I have visualized a dataset in 2D after employing PCA. 1 dimension is time and the Y dimension is First PCA component. As figure shows, there is relatively good separation between points (A, B). But unfortunately clustering methods (DBSCAN, SMO, KMEANS, Hierarchical) are not able to cluster these points in 2 clusters. As you see in section A there is a relative continuity and this continuous process is finished and Section B starts and there is rather big gap in comparison to past data between A and B.
I will be so grateful if you can introduce me any method and algorithm (or devising any metric from data considering its distribution) to be able to do separation between A and B without visualization. Thank you so much.
This is plot of 2 PCA components for the above plot(the first one). The other one is also the plot of components of other dataset which I get bad result,too.
This is a time series, and apparently you are looking for change points or want to segment this time series.
Do not treat this data set as a two dimensional x-y data set, and don't use clustering here; rather choose an algorithm that is actually designed for time series.
As a starter, plot series[x] - series[x-1], i.e. the first derivative. You may need to remove seasonality to improve results. No clustering algorithm will do this, they do not have a notion of seasonality or time.
If PCA gives you a good separation, you can just try to cluster after projecting your data through your PCA eigenvectors. If you don't want to use PCA, then you will need anyway an alternative data projection method, because failing clustering methods imply that your data is not separable in the original dimensions. You can take a look at non linear clustering methods such as the kernel based ones or spectral clustering for example. Or to define your own non-euclidian metric, which is in fact just another data projection method.
But using PCA clearly seems to be the best fit in your case (Occam razor : use the simplest model that fits your data).
I don't know that you'll have an easy time devising an algorithm to handle this case, which is dangerously (by present capabilities) close to "read my mind" clustering. You have a significant alley where you've marked the division. You have one nearly as good around (1700, +1/3), and an isolate near (1850, 0.45). These will make it hard to convince a general-use algorithm to make exactly one division at the spot you want, although that one is (I think) still the most computationally obvious.
Spectral clustering works well at finding gaps; I'd try that first. You might have to ask it for 3 or 4 clusters to separate the one you want in general. You could also try playing with SVM (good at finding alleys in data), but doing that in an unsupervised context is the tricky part.
No, KMeans is not going to work; it isn't sensitive to density or connectivity.

HMM application in speech recognition

this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms,
i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training :
steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues :
first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then :
it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.
You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?
I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.
Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.
The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions
If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.
There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN
I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Resources