Normalize a feature in this table - machine-learning

This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.

...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.

It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366

Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.

"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.

I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.

For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52

I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.

Related

Coefficients and Confidence Intervals - GLM Binomial (Logit)

I've run an Interrupted Time Series Analysis using a Binomial logistic regression in R.
glm(`Subject Refused Ratio` ~ Quarter + int2 + time_since_intervention2 , df, family = "binomial"(link='logit'), weights = sub_weight)
I want to derive the coefficients and confidence intervals for each of my outcomes and am currently doing so with the margins package, with the following outcome:
summary(margins(rrfit1a))
factor AME SE z p lower upper
int2 0.0963 0.1064 0.9050 0.3654 -0.1122 0.3047
Quarter -0.0006 0.0049 -0.1162 0.9075 -0.0101 0.0089
time_since_intervention2 -0.0056 0.0209 -0.2695 0.7875 -0.0466 0.0353
These seem largely consistent with the modelled data. For example it suggests the intervention (int2) could range between a 0.11 decrease and 0.30 increase.
However, I really need to add similar coefficient values and confidence intervals for the original Intercept. I have tried to do so using simple exp(coefficients) and the confint function within the MASS package. But the outcome doesn't quite tie in with what I would anticipate seeing.
exp(coefficients(rrfit1a))
(Intercept) Quarter int2 time_since_intervention2
0.9093160 0.9977377 1.4720697 0.9776187
For context the fitted value of the model in the first observation is around 0.47, which looks correct. So I wonder whether it is just a case of me misinterpreting the above or is there something more fundamental wrong with it?
Secondly, the confint outcome is:
> confint(rrfit1a, level = 0.90)
Waiting for profiling to be done...
5 % 95 %
(Intercept) -0.38990085 0.19896064
Quarter -0.03437363 0.02981353
int2 -0.31682909 1.09669529
time_since_intervention2 -0.16144941 0.11569710
This isn't what we'd expect to see or what our plotted confidence intervals look anything like.

arbitrarily weighted moving average (low- and high-pass filters)

Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.

Predictors of different size for time series prediction using LSTM with Keras

I would like to predict time series values X using another time series Y and the past value of X.In detail, I would like to predict X at time t (Xt) using (Xt-p,...,Xt-1) and (Yt-p,...,Yt-1,Yt) with p the dimension of the "look back".
So, my problem is that I do not have the same length for my 2 predictors.
Let's use a exemple to be clearer.
If I use a timestep of 2, I would have for one observation :
[(Xt-p,Yt-p),...,(Xt-1,Yt-1),(??,Yt)] as input and Xt as output. I do not know what to use instead of the ??
I understand that mathematically speaking I need to have the same length for my predictors, so I am looking for a value to replace the missing value.
I really do not know if there is a good solution here and if I could to something so any help would be greatly appreciated.
Cheers !
PS : you could see my problem as if I wanted to predict the number of ice cream sell one day in advance in a city using the forcast of weather for the next day. X would be the number of ice cream and Y could be the temperature.
You could e.g. do the following:
input_x = Input(shape=input_shape_x)
input_y = Input(shape=input_shape_y)
lstm_for_x = LSTM(50, return_sequences=False)(input_x)
lstm_for_y = LSTM(50, return_sequences=False)(input_y)
merged = merge([lstm_for_x, lstm_for_y], mode="concat") # for keras < 2.0
merged = Concatenate([lstm_for_x, lstm_for_y])
output = Dense(1)(merged)
model = Model([x_input, y_input], output)
model.compile(..)
model.fit([X, Y], X_next)
Where X is an array of sequences, X_forward is X p-steps ahead and Y is an array of sequences of Ys.

Estimating change of a cyclic boolean variable

We have a boolean variable X which is either true or false and alternates at each time step with a probability p. I.e. if p is 0.2, X would alternate once every 5 time steps on average. We also have a time line and observations of the value of this variable at various non-uniformly sampled points in time.
How would one learn, from observations, the probability that after t+n time steps where t is the time X is observed and n is some time in the future that X has alternated/changed value at t+n given that p is unknown and we only have observations of the value of X at previous times? Note that I count changing from true to false and back to true again as changing value twice.
I'm going to approach this problem as if it were on a test.
First, let's name the variables.
Bx is value of the boolean variable after x opportunities to flip (and B0 is the initial state). P is the chance of changing to a different value every opportunity.
Given that each flip opportunity is not related to other flip opportunities (there is, for example, no minimum number of opportunities between flips) the math is extremely simple; since events are not affected by the events of the past, we can consolidate them into a single computation, which works best when considering Bx not as a boolean value, but as itself a probability.
Here is the domain of the computations we will use: Bx is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of truth. P is a probability (with a value between 0 and 1 inclusive) representing the likelyhood of flipping at any given opportunity.
The probability of falseness, 1 - Bx, and the probability of not flipping, 1 - P, are probabilistic identities which should be quite intuitive.
Assuming these simple rules, the general probability of truth of the boolean value is given by the recursive formula Bx+1 = Bx*(1-P) + (1-Bx)*P.
Code (in C++, because it's my favorite language and you didn't tag one):
int max_opportunities = 8; // Total number of chances to flip.
float flip_chance = 0.2; // Probability of flipping each opportunity.
float probability_true = 1.0; // Starting probability of truth.
// 1.0 is "definitely true" and 0.0 is
// "definitely false", but you can extend this
// to situations where the initial value is not
// certain (say, 0.8 = 80% probably true) and
// it will work just as well.
for (int opportunities = 0; opportunities < max_opportunities; ++opportunities)
{
probability_true = probability_true * (1 - flip_chance) +
(1 - probability_true) * flip_chance;
}
Here is that code on ideone (the answer for P=0.2 and B0=1 and x=8 is B8=0.508398). As you would expect, given that the value becomes less and less predictable as more and more opportunities pass, the final probability will approach Bx=0.5. You will also observe oscillations between more and less likely to be true, if your chance of flipping is high (for instance, with P=0.8, the beginning of the sequence is B={1.0, 0.2, 0.68, 0.392, 0.46112, ...}.
For a more complete solution that will work for more complicated scenarios, consider using a stochastic matrix (page 7 has an example).

Mathematica NMinimize runs into memory problems

I'm trying to minimize my function "FunctionToMinimize", which is defined as follows:
FunctionToMinimize[a_, b_, c_, d_] := (2.35*Sqrt[
Variance[1/2*
(a*#1 + b*#2 + c*#3 + d*#4)
]
]
/Mean[1/2*(a*#1 + b*#2 + c*#3 + d*#4)])
&[DataList1[[1 ;; 1000]],DataList2[[1 ;; 1000]],
DataList3[[1 ;; 1000]], DataList4[[1 ;; 1000]]]
The four parameters a,b,c and d are restricted to be somewhere between 0.5 and 1.5. My Problem is now, that if I call
NMinimize[{Funktion[w, x, y, z],
0.75 < w < 1.25 && 0.75 < y < 1.25 && 0.75 < x < 1.25 && 0.75 < z < 1.25},
{w, x, y, z}]
the Mathematica kernel shuts down because it has not enough memory. If I use only the first 100 entries in my DataLists, it will find me results (in 4.1 sec), but if I use DataList[[1;;1000]] or even more entries, the kernel crashes.
Has anybody an idea, why the NMinimize function uses so much memory? I would need to have the minimization for 150'000 events in each list...
Thanks for your answer,
Cheers,
Andreas
I would guess (but haven't in any way checked) that the problem is that on each call to your function, Mathematica is trying to construct a symbolic expression derived from all your data and that occupies much more memory than you'd expect.
Regardless, the good news -- if you haven't long since moved on and forgotten about this problem -- is that you can turn the function into something much simpler.
So, first of all, the 2.35 and the 1/2s just change your function by a constant factor and don't affect where the minimum is, so let's ignore them. Next, your function is always non-negative, so minimizing it is the same as minimizing its square, so let's do that.
So now you're trying to minimize var(aw+bx+cy+dz)/mean(aw+bx+cy+dz)^2 where w,x,y,z are (perhaps quite long) vectors.
Now your numerator and denominator are both just quadratic forms in a,b,c,d whose coefficients depend (in fixed ways) on those vectors. Specifically, suppose your vectors have length N. Then your function is just
[sum(aw+bx+cy+dz)^2/N - sum(aw+bx+cy+dz)^2/N^2] / (sum(aw+bx+cy+dz)^2/N^2)
which you might prefer to write as N sum(aw+bx+cy+dz)^2 / sum(aw+bx+cy+dz)^2 - 1
and in that fraction, e.g., the coefficient of bc in the numerator is 2 sum(xy), and the coefficient in the denominator is 2 sum(x) sum(y).
So you can take your big vectors, compute the relevant coefficients once, and then just ask Mathematica to optimize a function of the form (quadratic / quadratic), which should be pretty painless.

Resources