How to draw a Huffman tree properly - huffman-code

I have the following symbols and probabilities and I would like to draw a Huffman tree for them:
s = 0.04 || i = 0.1 || n = 0.2 || b = 0.04 || a = 0.3 || d = 0.26 || ~ = 0.06
Based on Huffman algorithm, I generated the following tree:
This was done by:
Join s + i
Join the result of 1 and n
Join ~ + d
Join b + a
Join the result of 3 and 4
Join the result of 5 and 2
My questions:
is what I have done right or not? If so, is it acceptable that the final probability (result of 6) greater than 1?
Thanks

No, what you have done is not right, and no, the only thing that is acceptable is that the final number must equal the sum of the starting numbers.
The sums do match in your case, since 0.34 + 0.66 = 1, so I don't know why you're asking that. By the way, the numbers do not have to be probabilities, so the sum does not have to be 1. Often the numbers are frequencies, i.e. the count of the number of times that symbol appeared.
As for your tree, you must always join the two lowest numbers, be they a leaf or the top of a sub-tree. At the start, that's s = 0.04 and b = 0.04. You didn't do that, so your tree does not represent the application of Huffman's algorithm. Then to that 0.08 you add ~ = 0.06. And so on.

Related

How this method or formula for calculating ROC AUC works?

I was trying to calculate the AUC using MySQL for the data in table like below:
y p
1 0.872637
0 0.130633
0 0.098054
...
...
1 0.060190
0 0.110938
I came across the following SQL query which is giving the correct AUC score (I verified using sklearn method).
SELECT (sum(y*r) - 0.5*sum(y)*(sum(y)+1)) / (sum(y) * sum(1-y)) AS auc
FROM (
SELECT y, row_number() OVER (ORDER BY p) r
FROM probs
) t
Using pandas this can be done as follows:
temp = df.sort_values(by="p")
temp['r'] = np.arange(1, len(df)+1, 1)
temp['yr'] = temp['y']*temp['r']
print( (sum(temp.yr) - 0.5*sum(temp.y)*(sum(temp.y)+1)) / (sum(temp.y) * sum(1-temp.y)) )
I did not understand how we are able to calculate AUC using this method. Can somebody please give intuition behind this?
I am already familiar with the trapezoidal method which involves summing the area of small trapezoids under the ROC curve.
Short answer: it is Wilcoxon-Mann-Whitney statistic, see
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
The page has proof as well.
The bottom part of your formula is identical to the formula in the wiki. The top part is trickier. f in wiki corresponds to p in your data and t_0 and t_1 are indexes in the data frame. Note that we first sort by p, which makes our life easier.
Note that the double sum may be decomposed as
Sum_{t_1 such that y(t_1)=1} #{t_0 such that p(t_0) < p(t_1) and y(t_0)=0}
Here # stands for the total number of such indexes.
For each row index t_1 (such that y(t_1) =1), how many t_0 are such that p(t_0) < p(t_1) and y(t_0)=0? We know that there are exactly t_1 values of p that are less or equal than t_1 because values are sorted. We conclude that
#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = t_1 - #{t_0: t_0 <= t_1 and y(t_0)=1}
Now imagine scrolling down the sorted dataframe. For the first time we meet y=1, #{t_0: t_0 <= t_1 and y(t_0)=1}=1, for the second time we meet y=1, the same quantity is 2, for the third time we meet y=1, the quantity is 3, and so on. Therefore, when we sum the equality over all indexes t_1 when y=1, we get
Sum_{t_1: y(t_1)=1}#{t_0: p(t_0) < p(t_1) and y(t_0)=1) = Sum_{t_1: y(t_1)=1} t_1 - (1 + 2 + 3 + ... + n),
where n is the total number of ones in y column. Now we need to do one more simplification. Note that
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1)
If y(t_1) is not one, it is zero. Therefore,
Sum_{t_1: y(t_1)=1} t_1 = Sum_{t_1: y(t_1)=1} t_1 y(t_1) = Sum_{t} t y(t)
Plugging this to our formula and using that
1 + 2+ 3 + ... + n = n(n+1)/2
finished the proof of the formula you found.
P.S. I think that posting this question on math or stats overflow would make more sense.

How are the matrix values calculated in Octave when we divide a scalar with a vector?

I am starting to use Octave and I am trying to understand how is the underlying calculation done for dividing a Scalar by vector ?
I am able to understand how ./ is operating to give us the results - dividing 1 by every element of the matrix column. However, I am not able to get my head around how we get the values in the second case ? 1 / (1 + a)
Example :
g = 1 ./ (1 + a)
g =
0.50000
0.25000
0.20000
>> g = 1 / (1 + a)
g =
0.044444 0.088889 0.111111
When you divide 1 by a vector, it gives you a vector that yields 1 when multiplied on the left by the first vector. In this sense, it is a sort of 'inverse' of the vector, although it will only be a one way inverse. In your example:
>> (1/(1+a))*(1+a)
ans = 1
>> (1+a)*(1/(1+a))
ans =
0.088889 0.177778 0.222222
0.177778 0.355556 0.444444
0.222222 0.444444 0.555556
You could say 1/(1+a) is the left inverse of 1+a. This would also explain why the dimensions of the vector are transposed. Another way to put it: given a vector v, 1/v is the solution (w) of the vector equation w*v=1.

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

ML coursera submission (week 2) Feature Normalization

I have written the following code for section "feature normalization"
Here X is the Feature matrix (m*n) such that
m = number of examples
n = number of features
Code
mu = mean(X);
sigma = std(X);
m = size(X,1);
% Subtracting the mean from each row
for i = 1:m
X_norm(i,:) = X(i,:)-mu;
end;
% Dividing the STD from each row
for i = 1:m
X_norm(i,:) = X(i,:)./sigma;
end;
But on submitting it to the server built for Andrew Ng's class, It's not giving me any confirmation if it's wrong or correct.
==
== Part Name | Score | Feedback
== --------- | ----- | --------
== Warm-up Exercise | 10 / 10 | Nice work!
== Computing Cost (for One Variable) | 40 / 40 | Nice work!
== Gradient Descent (for One Variable) | 50 / 50 | Nice work!
== Feature Normalization | 0 / 0 |
== Computing Cost (for Multiple Variables) | 0 / 0 |
== Gradient Descent (for Multiple Variables) | 0 / 0 |
== Normal Equations | 0 / 0 |
== --------------------------------
== | 100 / 100 |
Is this a bug in the web frontend presentation layer or my code?
When the submit() does not give you any points, this means your answer is not correct.
This usually means, that either you have not implemented it yet, or there is a mistake in your implementation.
From what I can see, your indices are not correct. However, in order not to violate the code of conduct of this course, you should ask your question in the Coursera forum (without posting your code).
There are also tutorials with each programming exercise. Those are usually very helpful and guide you through the entire exercise.
You need to iterate for EACH feature
m = size(X,1);
What you are actually getting with m is the number of ROWS (example), but you want to get the number of COLUMNS (Features)
solution:
m = size(X,2);
Try this it worked for and notice that you're making mistake with dividing each row of X without subtracting to mean.
Combine both and do with less code like this -
% Subtracting the mean and Dividing the STD from each row:
for i = 1:m
X_norm(i,:) = (X(i,:) - mu) ./ sigma;
end;
At the end of the class, the final correct answer was given as featureNormalize.m:
function [X_norm, mu, sigma] = featureNormalize(X)
%description: Normalizes the features in X
% FEATURENORMALIZE(X) returns a normalized version of X where
% the mean value of each feature is 0 and the standard deviation
% is 1. This is often a good preprocessing step to do when
% working with learning algorithms.
X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));
% Instructions: First, for each feature dimension, compute the mean
% of the feature and subtract it from the dataset,
% storing the mean value in mu. Next, compute the
% standard deviation of each feature and divide
% each feature by it's standard deviation, storing
% the standard deviation in sigma.
%
% Note that X is a matrix where each column is a
% feature and each row is an example. You need
% to perform the normalization separately for
% each feature.
mu = mean(X);
sigma = std(X);
X_norm = (X - mu)./sigma;
end
If you're taking this class and feel the urge to copy and paste, you're in a grey-area on academic honesty. You're supposed to figure it out from first principles, not google it and regurgitate the answer.

Erlang Calculating Pi to X decimal places

I have been given this question to work on a solution. I'm struggling to get my head around the recursion. Some break down of the question would be very helpful.
Given that Pi can be estimated using the function 4 * (1 – 1/3 + 1/5 – 1/7 + …) with more terms giving greater accuracy, write a function that calculates Pi to an accuracy of 5 decimal places.
I have got some example code however I really don't understand where/why the variables are entered like this. Possible breakdown of this code and why it is not accurate would be appreciated.
-module (pi).
-export ([pi/0]).
pi() -> 4 * pi(0,1,1).
pi(T,M,D) ->
A = 1 / D,
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
true -> T
end.
The formula comes from the evaluation of tg(pi/4) which is equal to 1. The inverse:
pi/4 = arctg(1)
so
pi = 4* arctg(1).
using the technique of the Taylor series:
arctg (x) = x - x^3/3 + ... + (-1)^n x^(2n+1)/(2n+1) + o(x^(2n+1))
so when x = 1 you get your formula:
pi = 4 * (1 – 1/3 + 1/5 – 1/7 + …)
the problem is to find an approximation of pi with an accuracy of 0.00001 (5 decimal). Lookinq at the formula, you can notice that
at each step (1/3, 1/5,...) the new term to add:
is smaller than the previous one,
has the opposite sign.
This means that each term is an upper estimation of the error (the term o(x^(2n+1))) between the real value of pi and the evaluation up to this term.
So it can be use to stop the recursion at a level where it is guaranty that the approximation is better than this term. To be correct, the program
you propose multiply the final result of the recursion by 4, so the error is no more guaranteed to be smaller than term.
looking at the code:
pi() -> 4 * pi(0,1,1).
% T = 0 is the initial estimation
% M = 1 is the sign
% D = 1 initial value of the term's index in the Taylor serie
pi(T,M,D) ->
A = 1 / D,
% evaluate the term value
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
% if the precision is not reach call the pi function with,
% new serie's evaluation (the previous one + sign * term): T+(M*A)
% new inverted sign: M*-1
% new index: D+2
true -> T
% if the precision is reached, give the result T
end.
To be sure that you have reached the right accuracy, I propose to replace A > 0.00001 by A > 0.0000025 (= 0.00001/4)
I can't find any error in this code, but I can't test it right now, anyway:
T is probably "total", M is "multiplicator", and D is "divisor".
By every step you:
check (the 'if' is in some way similar to a switch/case in c/c++/java) if the next term (A = 1/D) is bigger than 0.00001. If not, you can stop the recursion, you've got the 5 decimal places you were looking for. So "if true (default case) -> return T"
if it's bigger, you multiply A by M, add to the total, then multiply M by -1, add 2 to D, and repeat (so you get the next term, add again, and so on).
pi(T,M,D) ->
A = 1 / D,
if
A > 0.00001 -> pi(T+(M*A), M*-1, D+2);
true -> T
end.
I don't know Erlang myself but from the looks of it you are checking if 1/D is < 0.00001 when in reality you should be checking 4 * 1/D because that 4 is going to be multiplied through. For example in your case if 1/D was 0.000003 you would stop four function, but your total would actually have changed by 0.000012. Hope this helps.

Resources