Does optimizing a bijection-transformed value affect performance or accuracy? - machine-learning

If I want to optimize a function with respect to some constrained value, I can find a bijective map between an unconstrained space and the constrained space, then optimize the composition of the original function and the bijective map with respect to the unconstrained value.
Does optimizing in a different space affect the performance or accuracy of optimization? And does it vary between bijective maps?
My use case is training constrained Gaussian process model hyperparameters in GPflow using TensorFlow Probability's bijectors.

If I understand you correctly, you might have for example some variable that is constrained to be positive and want to optimize it. And for that you train the variable in the unconstrained space?
That would be pretty common in machine learning where you for example enforce a variance (of let's say the likelihood) to be positive by taking the exponent of the unconstrained value.
I guess the effect on the optimization very much depends on how you optimize it. For gradient based methods it does have an effect, and sometimes small tricks are helpful to improve those issues (e.g. shifting, so that your transformation is tf.exp(shift_val + unconstrained_variable) ).
And yes afaik it varies inbetween different mappings. In my example, the softplus and exponential transformation result in different gradients. Tough I'm not sure if there's a consent on which one is preferable.
I'd just try a few different ones. As long as it doesn't lead to numerical issues, either transformation/bijection should be fine.
EDIT: just to clarify. The bijection should not affect the solution space, just the optimization path itself.


If we can clip gradient in WGAN, why bother with WGAN-GP?

I am working on WGAN and would like to implement WGAN-GP.
In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by definition is equivalent to 1-Lipschitiz constraint), so why do we bother to penalize the gradient? Why don't we just clip the gradient?
The reason is that clipping in general is a pretty hard constraint in a mathematical sense, not in a sense of implementation complexity. If you check original WGAN paper, you'll notice that clip procedure inputs model's weights and some hyperparameter c, which controls range for clipping.
If c is small then weights would be severely clipped to a tiny values range. The question is how to determine an appropriate c value. It depends on your model, dataset in a question, training procedure and so on and so forth. So why not to try soft penalizing instead of hard clipping? That's why WGAN-GP paper introduces additional constraint to a loss function that forces gradient's norm to be as much close to 1 as possible, avoiding hard collapsing to a predefined values.
The answer by CaptainTrunky is correct but I also wanted to point out one, really important, aspect.
Citing the original WGAN-GP paper:
Implementing k-Lipshitz constraint via weight clipping biases the critic towards much simpler functions. As stated previously in [Corollary 1], the optimal WGAN critic has unit gradient norm almost everywhere under Pr and Pg; under a weight-clipping constraint, we observe that our neural network architectures which try to attain their maximum gradient norm k end up learning extremely simple functions.
So as You can see weight clipping may (it depends on the data You want to generate - autors of this article stated that it doesn't always behave like that) lead to undesired behaviour. When You will try to train WGAN to generate more complex data the task has high possibility of failure.

Convergence and regularization in linear regression classifier

I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?

Homography and projective transformation

im trying to write a code that will do projective transformation, but with more than 4 key points. i found this helpful guide but it uses 4 points of reference
i know that matlab uses has a function tcp2form that handles that, but i haven't found a way so far.
anyone can give me some guidance, on how to do so? i can solve the equations using (least squares), but i'm stuck since i have a matrix that is larger than 3*3 and i can't multiple the homogeneous coordinates.
If you have more than four control points, you have an overdetermined system of equations. There are two possible scenarios. Either your points are all compatible with the same transformation. In that case, any four points can be used, and the rest will match the transformation exactly. At least in theory. For the sake of numeric stability you'd probably want to choose your points so that they are far from being collinear.
Or your points are not all compatible with a single projective transformation. In this case, all you can hope for is an approximation. If you want the best approximation, you'll have to be more specific about what “best” means, i.e. some kind of error measure. Measuring things in a projective setup is inherently tricky, since there are usually a lot of arbitrary decisions involved.
What you can try is fixing one matrix entry (e.g. the lower right one to 1), then writing the conditions for the remaining 8 coordinates as a system of linear equations, and performing a least squares approximation. But the choice of matrix representative (i.e. fixing one entry here) affects the least squares error measure while it has no effect on the geometric meaning, so this is a pretty arbitrary choice. If the lower right entry of the desired matrix should happen to be zero, you'd computation will run into numeric problems due to overflow.

How do I make a U-matrix?

How exactly is an U-matrix constructed in order to visualise a self-organizing-map? More specifically, suppose that I have an output grid of 3x3 nodes (that have already been trained), how do I construct a U-matrix from this? You can e.g. assume that the neurons (and inputs) have dimension 4.
I have found several resources on the web, but they are not clear or they are contradictory. For example, the original paper is full of typos.
A U-matrix is a visual representation of the distances between neurons in the input data dimension space. Namely you calculate the distance between adjacent neurons, using their trained vector. If your input dimension was 4, then each neuron in the trained map also corresponds to a 4-dimensional vector. Let's say you have a 3x3 hexagonal map.
The U-matrix will be a 5x5 matrix with interpolated elements for each connection between two neurons like this
The {x,y} elements are the distance between neuron x and y, and the values in {x} elements are the mean of the surrounding values. For example, {4,5} = distance(4,5) and {4} = mean({1,4}, {2,4}, {4,5}, {4,7}). For the calculation of the distance you use the trained 4-dimensional vector of each neuron and the distance formula that you used for the training of the map (usually Euclidian distance). So, the values of the U-matrix are only numbers (not vectors). Then you can assign a light gray colour to the largest of these values and a dark gray to the smallest and the other values to corresponding shades of gray. You can use these colours to paint the cells of the U-matrix and have a visualized representation of the distances between neurons.
Have also a look at this web article.
The original paper cited in the question states:
A naive application of Kohonen's algorithm, although preserving the topology of the input data is not able to show clusters inherent in the input data.
Firstly, that's true, secondly, it is a deep mis-understanding of the SOM, thirdly it is also a mis-understanding of the purpose of calculating the SOM.
Just take the RGB color space as an example: are there 3 colors (RGB), or 6 (RGBCMY), or 8 (+BW), or more? How would you define that independent of the purpose, ie inherent in the data itself?
My recommendation would be not to use maximum likelihood estimators of cluster boundaries at all - not even such primitive ones as the U-Matrix -, because the underlying argument is already flawed. No matter which method you then use to determine the cluster, you would inherit that flaw. More precisely, the determination of cluster boundaries is not interesting at all, and it is loosing information regarding the true intention of building a SOM. So, why do we build SOM's from data?
Let us start with some basics:
Any SOM is a representative model of a data space, for it reduces the dimensionality of the latter. For it is a model it can be used as a diagnostic as well as a predictive tool. Yet, both cases are not justified by some universal objectivity. Instead, models are deeply dependent on the purpose and the accepted associated risk for errors.
Let us assume for a moment the U-Matrix (or similar) would be reasonable. So we determine some clusters on the map. It is not only an issue how to justify the criterion for it (outside of the purpose itself), it is also problematic because any further calculation destroys some information (it is a model about a model).
The only interesting thing on a SOM is the accuracy itself viz the classification error, not some estimation of it. Thus, the estimation of the model in terms of validation and robustness is the only thing that is interesting.
Any prediction has a purpose and the acceptance of the prediction is a function of the accuracy, which in turn can be expressed by the classification error. Note that the classification error can be determined for 2-class models as well as for multi-class models. If you don't have a purpose, you should not do anything with your data.
Inversely, the concept of "number of clusters" is completely dependent on the criterion "allowed divergence within clusters", so it is masking the most important thing of the structure of the data. It is also dependent on the risk and the risk structure (in terms of type I/II errors) you are willing to take.
So, how could we determine the number classes on a SOM? If there is no exterior apriori reasoning available, the only feasible way would be an a-posteriori check of the goodness-of-fit. On a given SOM, impose different numbers of classes and measure the deviations in terms of mis-classification cost, then choose (subjectively) the most pleasing one (using some fancy heuristics, like Occam's razor)
Taken together, the U-matrix is pretending objectivity where no objectivity can be. It is a serious misunderstanding of modeling altogether.
IMHO it is one of the greatest advantages of the SOM that all the parameters implied by it are accessible and open for being parameterized. Approaches like the U-matrix destroy just that, by disregarding this transparency and closing it again with opaque statistical reasoning.

Why do we maximize variance during Principal Component Analysis?

I'm trying to read through PCA and saw that the objective was to maximize the variance. I don't quite understand why. Any explanation of other related topics would be helpful
Variance is a measure of the "variability" of the data you have. Potentially the number of components is infinite (actually, after numerization it is at most equal to the rank of the matrix, as #jazibjamil pointed out), so you want to "squeeze" the most information in each component of the finite set you build.
If, to exaggerate, you were to select a single principal component, you would want it to account for the most variability possible: hence the search for maximum variance, so that the one component collects the most "uniqueness" from the data set.
Note that PCA does not actually increase the variance of your data. Rather, it rotates the data set in such a way as to align the directions in which it is spread out the most with the principal axes. This enables you to remove those dimensions along which the data is almost flat. This decreases the dimensionality of the data while keeping the variance (or spread) among the points as close to the original as possible.
Maximizing the component vector variances is the same as maximizing the 'uniqueness' of those vectors. Thus you're vectors are as distant from each other as possible. That way if you only use the first N component vectors you're going to capture more space with highly varying vectors than with like vectors. Think about what Principal Component actually means.
Take for example a situation where you have 2 lines that are orthogonal in a 3D space. You can capture the environment much more completely with those orthogonal lines than 2 lines that are parallel (or nearly parallel). When applied to very high dimensional states using very few vectors, this becomes a much more important relationship among the vectors to maintain. In a linear algebra sense you want independent rows to be produced by PCA, otherwise some of those rows will be redundant.
See this PDF from Princeton's CS Department for a basic explanation.
max variance is basically setting these axis that occupy the maximum spread of the datapoints, why? because the direction of this axis is what really matters as it kinda explains correlations and later on we will compress/project the points along those axis to get rid of some dimensions
