If data has linear frelationship, won't linear regression lead to zero in-sample error - machine-learning

I am using the Learning from Data textbook by Yaser Abu-Mostafa et al. I am curious about the following statement in the linear regression chapter and would like to verify that my understanding is correct.
After talking about the "pseudo-inverse" way to get the "best weights" (best for minimizing squared error), i.e w_lin = (X^T X)^-1 X^T y
The statement is "The linear regression weight vector is an attempt to map the inputs X to the outputs y. However, w_lin does not produce y exactly, but produces an estimate X w_lin which differs from y due to in sample error.
If the data is on a hyper-plane, won't X w_lin exactly match y (i.e in-sample error = 0)? I.e above statement is only talking about data that is not linearly separable.

Here, 'w_lin' is the not the same for all data points (all pairs of (X,y)).
The linear regression model finds the best possible weight vector (or best possible 'w_lin') considering all data points such that X*w_lin gives a result very close to 'y' for any data point.
Hence the error will not be zero unless all data points line on a straight line.

The community might not get whole context unless the book is opened because not everything that the author of the book says might have been covered in your post. But let me try to answer.
Whenever any model is formed, there are certain constants used whose value is not known beforehand but are used to fit the line/curve as good as possible. Also, the equations, many a times, contain an element of randomness. Variables that take random values cause some errors when actual and expected outputs are computed.
Suggested reading: Errors and residuals

Related

Search for the optimal value of x for a given y

Please help me find an approach to solving the following problem: Let X is a matrix X_mxn = (x1,…,xn), xi is a time series and a vector Y_mx1. To predict values ​​from Y_mx1, let's train some model, let linear regression. We get Y = f (X). Now we need to find X for some given value of Y. The most naive thing is brute force, but what are the competent ways to solve such problems? Perhaps there is a use of the scipy.optimize package here, please enlighten me.
get an explanation or matherial to read for understanding
Most scipy-optimize algorithm use gradient method, for those optimization problem, we could apply these into re-engineering of data (find the best date to invest in the stock market...)
If you want to optimize the result, you should choose a good step size and suitable optimize method.
However, we should not classify tge problem as "predict" of xi because what we are doing is to find local/global maximum/minimum.
For example Newton-CG, your data/equation should contain all the information needed/a simulation, but no prediction is made from the method.
If you want to do a pretiction on "time", you could categorize the time data in "year,month..." then using unsupervise learning to "group" the data. If trend is obtained, then we can re-enginning the result to know the time

How to Address Noise Resulting from Inverse-Scaling for a Machine Learning Task

I'm still a little unsure of whether questions like these belong on stackoverflow. Is this website only for questions with explicit code? The "How to Format" just tells me it should be a programming question, which it is. I will remove my question if the community thinks otherwise.
I have created a neural network and am predicting reasonable values for most of my data (the task is multi-variate time series forecasting).
I scale my data before inputting it using scikit-learn's MinMaxScaler(0,1) or MinMaxScaler(-1,1) (the two primary scalings I am using).
The model learns, predicts, and I inverse the scaling using MinMaxScaler()'s inverse_transform method to visually see how close my predictions were to the actual values. However, I notice that the inverse_scaled values for a particular part of the vector I predicted have now become very noisy. Here is what I mean (inverse_scaled prediction):
Left end: noisy; right end: not-so noisy.
I initially thought that perhaps my network didn't learn that part of the vector well, so is just outputting ~random values. BUT, I notice that the predicted values before the inverse scaling seem to match the actual values very well, but that these values are typically near 0 or -1 (lower limit of the feature scale) because of the fact that these values have a very large spread (unscaled mean= 1E-1, max= 1E+1 [not an outlier]). Example (scaled prediction):
So, when inverse transforming these values (again, often near -1 or 0), the transformed values exhibit loud noise, as shown in the images.
Questions:
1.) Should I be using a different scaler/scaling differently, perhaps one that exponentially/nonlinearly scales? MinMaxScaler() scales each column. Simply dropping the high-magnitude data isn't an option since they are real, meaningful data. 2.) What other solutions can help this?
Please let me know if you'd like anything else clarified.

Why the hypothesis has to introduce two parameters, namely θ0 and θ1

I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources