Feature Hashing / Avalanche Effect - machine-learning

I’ve been reading a bit about feature hashing for dimensionality reduction. I understand that it’s important to use a hash function that has a uniform output distribution (the chance of an input being mapped to a specific value is that same as every other value in the range), as well an avalanche/cascade effect (a small change in input produces a big change in output). These properties will ensure that collisions between features will be independent of their frequency. However, I’m still unclear on how the avalanche effect (specifically) impacts this. Could anyone explain why/how it matters here? What constitutes a ‘big change’ in output?
References:
http://blog.someben.com/2013/01/hashing-lang/
http://metaoptimize.com/qa/questions/6943/what-is-the-hashing-trick#6945

The idea is that if you have a tight cluster of input data, you still want the hashing function to splatter the outputs all over the map. The effect is that a collision will be a uniformly random event, as opposed to that tight cluster giving you a spate of collisions -- or a spate of collisions with the mappings of another tight cluster.
"Big change" suggests that your hashing function, h, should show that h(a) - h(b) is stochastically independent of (a-b).
Is that enough? Follow up if you need more explanation.

The avalanche effect ensures that a tiny change in the input (e.g. words: cloud vs clouds) will produce a big change in the output, that is, that close input values will produce distant and unpredictable output values.

Related

Machine learning algorithm to predict/find/converge to correct parameters in mathematical model

I am currently trying to find a machine learning algorithm that can predict about 5 - 15 parameters used in a mathematical model(MM). The MM has 4 different ordinary differential equations(ODE) and a few more will be added and thus more parameters will be needed. Most of the parameters can be measured, but others need to be guessed. We know all the 15 parameters, but we want the computer to guess 5 or even 10. To test if parameters are correct, we fill in the parameters in the MM and then calculate the ODEs with a numerical method. Subsequently we calculate the error between the calculations of the model with the parameters we know(and want to guess) and the calculated values of the MM for which we guessed the parameters. Calculating the values of the models ODEs is done multiple times, the ODEs represent one minute in real time and we calculate for 24 hours, thus 1440 calculations.
Currently we are using a particle filter to gues the variables, this works okay but we want to see if there are any better methods out there to gues parameters in a model. The particle filter takes a random value for a parameter which lies between a range we know about the parameter, e.g. 0,001 - 0,01. this is done for each parameter that needs to be guessed.
If you can run a lot of full simulations (tens of thousands) you can try black-box optimization. I'm not sure if black-box is the right approach for you (I'm not familiar with particle filters). But if it is, CMA-ES is a clear match here and easy to try.
You have to specify a loss function (e.g. the total sum of square errors for a whole simulation) and an initial guess (mean and sigma) for your parameters. Among black-box algorithms CMA-ES is a well-established baseline. It is hard to beat if you have only few (at most a few hundreds) continuous parameters and no gradient information. However anything less black-box-ish that can e.g. exploit the ODE nature of your problem will do better.

Accurate general description of Regression versus Classification

So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles.
For instance, I vaguely remember that at some point I heard the following description:
The output (label) of a classification task is discrete and finite while the output (label) of a regression task is continuous and can be infinite
The one word that I am unsure of is infinite for regression in this description.
For instance, if you assume that (for whatever reason) you have 2D data points that are almost distributed like a sine wave (with some noise) and you use polyfit to fit a polynomial of k-degree on it (see Figure here here k = 8). Now you have some data in a specific range, e.g., here the range of available points in the x-direction is [0,12], which is used to fit the polynomial.
However wouldn't you be able to quickly get the y-result for the value x = 1M (or an arbitrarily large number), as you have the general shape of the polynomial? Is that not what infinite labels mean?
Maybe I am just wrongly remembering stuff that I learned years ago ;).
best regards
First of all, this is a question more fitting for the more theoretically inclined sites of StackExchange, like Stats Stackexchange Math Stackexchange, or the Data Science Stackexchange, which conveniently also provide answers to your question.
But not quite. In any case, your problem seems to be on the distinction between input and output. The type of task (i.e. either classificaiton or regression) is solely based on the output of your model, but has nothing to do with the input.
You could have a ton of "continuous input variables" (or even a mixture with distinct ones), and still call it a classification task, if it has a distinct amount of output values.
Furthermore, the infinite simply refers to the fact that these values are not bounded, i.e. you cannot restrict your regression task to a specific range easily. If you suddenly input a value completely outside of your training value range (as with your example), you will likely get an "infinite" y value, since your network will only be trained on this specific range; a problem that also happens with polynomial fitting, as the following example shows:
The red line could be the learned function for your network, so if you suddenly go far beyond known values, you likely get some extreme value (unless you train very well).
Opposed to that, the classification network would still predict any of the given classes. I like to imagine it kind of a Voronoi diagram: Even if your point is arbitrarily far from any of the previous points, it will still belong to some category.

How to exclude poses of a wheel based robot which place behind of the porevious pose

I am currently working on coding sensor fusion of a wheel based robot pose from GPS, Lidar, Vision and Vehicle measure. Its model is basic kinematics using EKF and no discrimination against sensors i.e. data comes in based on time stamp.
I have difficulty to fuse those sensors due to following issue;
Sometimes when the latest incoming data comes in from different sensor from a sensor gave previous state, the latest pose of the robot comes in behind previous pose. Therefore data fusion does not get so smooth and zigzag-ed as a result.
I would like discard data which plots behind/backwards of the previous data and take data which poses always forward/ahead of previous state even when sensor to provide the data changes between timestamp t and timestamp t+1. Since the data frame is global frame, it is impossible to rely on its x coordinate in minus to achieve this.
Please let me know if you had some idea on this. Thank you so much in advance.
Best,
Preliminary warning
Let me slip here a warning before suggesting posible solutions to your problem: be careful with discarding data based on your current estimate, since you never know if last measure is "pulling pose back" or previous one was wrong and caused your estimate to move forward too much.
Posible solutions
In a Kalman-like filter, observations are assumed to provide independent, uncorrelated information about state vector variables. These observations are assumed to have a random error distributed as a zero mean gaussian variable. Real life is harder, though :-(
Sometimes, measures are affected by a "bias" (a fixed term, similar to the gaussian error having a non-zero mean). e.g. tropospheric perturbations are known to introduce a position error in GPS fixes that drifts slowly over time.
If you take several sensors observing the same variable, as GPS and Lidar for for position, but they have different biases, your estimation will be jumping back and forth. Scaling problems can have a similar effect.
I will assume this is the root of your problem. If not, please refine your question.
How can you mitigate this problem? I see several alternatives:
Introduce a bias/scale correction term in your state vector to compensate sensor bias/drift. This is a very common trick in EKFs for inertial sensor fusion (gyro/accelerometer), that can work nice when tuned properly.
Apply some preprocessing to sensory inputs to correct known problems. It can be difficult to tune a filter for estimating state vector and sensor parameters at the same time.
Change how observations are interpreted. For example, use difference between consecutive position observations so that you are creating a fake odometer sensor. This greatly reduces the drift problem.
Post-process your output. Instead of discarding observations, integrate them and keep the "jumping" state vector internally, but smooth the output vector to eliminate the jumps. This is done in some UAV autopilots because such jumps affect the performance of PID controllers.
Finally, the most obvious and simple approach: discard observations based on some statistical test. A chi-square test of the residual can be used to determine if an observation is too far from expected values and must be discarded. Be careful with this options, though: observation rejection schemes must be completed with a state vector reinitialization logic to resutl in a stable behavior.
Almost all these solutions require knowning the source of each observation, so you would no longer be able to treat them indistinctly.

Most appropriate normalization / transformation method for skewed features?

I am trying to pre-process biological data to train a neural network and despite an extensive search and repetitive presentation of the various normalization methods I am none the wiser as to which method should be used when. In particular I have a number of input variables which are positively skewed and have been trying to establish whether there is a normalisation method that is most appropriate.
I was also worried about whether the nature of these inputs would affect performance of the network and as such have experimented with data transformations (log transformation in particular). However some inputs have many zeros but may also be small decimal values and seem to be highly affected by a log(x + 1) (or any number from 1 to 0.0000001 for that matter) with the resulting distribution failing to approach normal (either remains skewed or becomes bimodal with a sharp peak at the min value).
Is any of this relevant to neural networks? ie. should I be using specific feature transformation / normalization methods to account for the skewed data or should I just ignore it and pick a normalization method and push ahead?
Any advice on the matter would be greatly appreciated!
Thanks!
As features in your input vector are of different nature, you should use different normalization algorithms for every feature. Network should be feeded by uniformed data on every input for better performance.
As you wrote that some data is skewed, I suppose you can run some algoritm to "normalize" it. If applying logarithm does not work, perhaps other functions and methods such as rank transforms can be tried out.
If the small decimal values do entirely occur in a specific feature, then just normalize it in specific way, so that they get transformed into your work range: either [0, 1] or [-1, +1] I suppose.
If some inputs have many zeros, consider removing them from main neural network, and create additional neural network which will operate on vectors with non-zeroed features. Alternatively, you may try to run Principal Component Analysis (for example, via Autoassociative memory network with structure N-M-N, M < N) to reduce input space dimension and so eliminate zeroed components (they will be actually taken into account in the new combined inputs somehow). BTW, new M inputs will be automatically normalized. Then you can pass new vectors to your actual worker neural network.
This is an interesting question. Normalization is meant to keep features' values in one scale to facilitate the optimization process.
I would suggest the following:
1- Check if you need to normalize your data. If, for example, the means of the variables or features are within same scale of values, you may progress with no normalization. MSVMpack uses some normalization check condition for their SVM implementation. If, however, you need to do so, you are still advised to run the models on the data without Normalization.
2- If you know the actual maximum or minimum values of a feature, use them to normalize the feature. I think this kind of normalization would preserve the skewedness in values.
3- Try decimal value normalization with other features if applicable.
Finally, you are still advised to apply different normalization techniques and compare the MSE for evey technique including z-score which may harm the skewedness of your data.
I hope that I have answered your question and gave some support.

How to interpolate between data points?

I am currently developing a piece of software using opencv and qt that plots data points. I need to be able fill in an image from incomplete data. I want to interpolate between the points I have. Can anyone recommend a library or function that could help me. I thought maybe the opencv reMap method but I can't seem to get that to work.
The data is a 2-d matrix of intensity values. I want to create an image of some sort. Its a school project.
Interpolation is a complex subject. There are infinitely many ways to interpolate a set of points, and this assuming that you truly do wish to do interpolation, and not smoothing of any sort. (An interpolant reproduces the original data points exactly.) And of course, the 2-d nature of this problem makes things more difficult.
There are several common schemes for interpolation of scattered data in 2-d. Actually, for those who have access to it, a very nice paper is available (Richard Franke, "Scattered data interpolation: Tests of some methods", Mathematics of Computation, 1982.)
Perhaps the most common method used is based on a triangulation of your data. Merely build a triangulation of the domain from your data points. Then any point inside the convex hull of the data must lie inside exactly one of the triangles, or it will be on a shared edge. This allows you to interpolate linearly inside the triangle. If you are using MATLAB, then the function griddata is available for this express purpose.)
The problem when trying to populate a complete rectangular image from scattered points is that very likely the data does not extend to the 4 corners of the array. In that event, a triangulation based scheme will fail, since the corners of the array do not lie inside the convex hull of the scattered points. An alternative then is to use "radial basis functions" (often abbreviated RBF). There are many such schemes to be found, including Kriging, when used by the geostatistics community.
http://en.wikipedia.org/wiki/Kriging
Finally, inpainting is the name for a scheme of interpolation where elements are given in an array, but where there are missing elements. The name obviously refers to that done by an art conservator who needs to repair a tear or rip in a valuable piece of artwork.
http://en.wikipedia.org/wiki/Inpainting
The idea behind inpainting is typically to formulate a boundary value problem. That is, define a partial differential equation on the region where there is a hole. Using the known boundary values, fill in the hole by solving the PDE for the unknown elements. This can be computationally intensive if there are a huge number of unknown elements, since it typically requires the solution of at least a massive sparse system of linear equations. If the PDE is a nonlinear one, then it becomes a more intensive problem yet. A simple, reasonably good choice for the PDE is the Laplacian, which results in a linear system that extrapolates well. Again, I can offer a solution for a MATLAB user.
http://www.mathworks.com/matlabcentral/fileexchange/4551
Better choices for the PDE may come from nonlinear PDEs. Once such is the Navier/Stokes equation. It is well suited to modeling the types of surfaces typically seen, but it is also more difficult to deal with. As in many facets of life, you get what you pay for.
Phew! Big subject.
The "right" answer depends a lot on your problem domain and various details of what you're doing.
Interpolating in more than 1 dimension requires making some choices. I'll assume that you are plotting on a regular grid, but that some of your grid points have no data. Big question: are the missing points sparse, or do they make big blobs?
You can't add information, so you're just trying to establish something that will look OK.
Conceptually simple suggestion (but the implementation may be some work):
For each region on missing data, identify all the edge points. That is find the x's in this figure
oooxxooo
oox..xoo
oox...xo
ox..xxoo
oox.xooo
oooxoooo
where the .'s are the points missing data, and the x's and o's have data (for a single missing point, this will be the four nearest neighbors). Fill in each missing data point with an average over the edge points around this blob. To make it smooth, weight each point by 1/d where d is the taxidriver distance (delta x + delta y) between the two points..
From before we had any details:
In the absence of that kind of information, have you tried straight ahead linear interpolation? If your data is reasonably dense this might do it for you, and it is simple enough to code in-line when you need it.
Next step is usually a cubic spline, but for that you'll probably want to grab an existing implementation.
When I need something more powerful than a quick linear interpolation, I usually use ROOT (and pick one of the TSpline classes), but this may be more overhead than you need.
As noted in the comments, ROOT is big, and while it is fast, it does try to force you to do things the ROOT way, so it can have a big effect on your program.
A linear interpolation between (or indeed extrapolation from) two points (x1, y1) and (x2, y2) gives you
y_i = (x_i-x1)*(y2-y1)/(x2-x1)
Considering this is a simple school project, probably the easiest interpolation technique to implement is the "Nearest Neighbors"
For each missing data point you find the nearest "filled" data point and use that as the value.
If you want to improve the retults a little bit more, then you can lets say, find K nearest data points, and use their weighted average as the value of your missing data point.
the weight could be proportional to the distance of the point from the missing data point.
There are zillion other techniques, but nearest neighbor is probably the easiest to implement.
if I understand that your need is as follows.
I think you have a subset of x,y,Intensity for a dimension of L by W and you want to fill for all X ranging from 0 to L and Y ranging from 0 to W.
If this is your question, then solution is to get other intensities by using Filters.
I think Bayer filter or Gaussian filter would do the job for you.
You can google these filters and you will get answers to implement.
Best of luck.

Resources