Estimated size of the self-join operation on a relation R, given a histogram for R - histogram

Query optimizers typically use summaries of data distributions to estimate the sizes of the intermediate tables generated during query processing. One popular such summarization scheme is a histogram, whereby the input range is partitioned into buckets and a cumulative count is maintained of the number of tuples falling in each bucket. The distribution within a bucket is assumed to be uniform for the purposes of estimation.
The following shows one such histogram for a relation R on a discrete attribute a with domain [1..10]:
Bucket 1: range = [1..2] Cumulative tuple count = 6
Bucket 2: range = [3..8] Cumulative tuple count = 30
Bucket 3: range = [9..10] Cumulative tuple count = 10
What is the estimated size of the self-join operation R x R
A) 46
B) 218
C) 248
D) 1,036
E) 5,672
Answer given in solutions : B
How is the answer to be calculated?

The size of a self-join on attribute R is equal to the summation of the frequency of each value of attribute R.
Here the frequency is given in buckets, e.g. the first bucket has 2 values r with frequency = 6, so we can assume the frequency of each value in bucket one is frequency = 3, similarly for bucket two frequency of each = 30/6 = 5, and for bucket three frequency of each value = 10/2 = 5.
Therefore, the size is
Size = [(3^2)*2] + [(5^2)*6] + [(5^2)*2]
= 218

I've been trying to figure this one out myself (it's from the GRE Computer Science subject test preparation exam).
So far I haven't found an answer as to why the answer is 218, but I have found a connection between the numbers given and the correct answer.
It turns out that that sum of the square of the cumulative tuple counts divided by the number of discrete values in each bucket, you get 218. Less abstractly: 6²/2 + 30²/5 + 10²/2 = 218.
It's not an answer, but at least there's a connection =)

Related

Calculating mean of normal distribution

How can I calculate the mean of normal distribution knowing the sd, a percentile and its value ?
I've got a question where the :
sd = 100
the 18 percentile's value is 1200
I standardized the distribution converting it to the Z score and using fi function and Z table.
then tried to calculate by P(Z > ((1200-mean)/100)) = 0.18
I got that mean = 1142.858 but it is a wrong answer.
what did I do wrong ?
There are two different solutions of this question which are depends on the how the percentile is selected.
If we assume data are arranged in ascending order and pick the 1200 as initial 18 percentile value, the appropriate probabilty function would be P(z<((1200-mean)/100)) = 0.18 and then, if we apply the InvNorm function (Inverse Normal Probability Distribution Function), the corresponding z-score for a probability will be -0.915 which will make the equation as follows;
P(z<-0.915)=0.18 -> -0.915 = (1200-mean)/100 -> the mean will be 1291.5
If we assume data are arranged in ascending order and pick the 1200 as last 18 percentile value,the appropriate probabilty function would be P(z>((1200-mean)/100)) = 0.18 and then, if we apply the InvNorm function (Inverse Normal Probability Distribution Function), the corresponding z-score for a probability will be 0.915 which will make the equation as follows;
P(z> 0.915)=0.18 -> 0.915 = (1200-mean)/100 -> the mean will be 1108.5

Dealing with NaN (missing) values for Logistic Regression- Best practices?

I am working with a data-set of patient information and trying to calculate the Propensity Score from the data using MATLAB. After removing features with many missing values, I am still left with several missing (NaN) values.
I get errors due to these missing values, as the values of my cost-function and gradient vector become NaN, when I try to perform logistic regression using the following Matlab code (from Andrew Ng's Coursera Machine Learning class) :
[m, n] = size(X);
X = [ones(m, 1) X];
initial_theta = ones(n+1, 1);
[cost, grad] = costFunction(initial_theta, X, y);
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = ...
fminunc(#(t)(costFunction(t, X, y)), initial_theta, options);
Note: sigmoid and costfunction are working functions I created for overall ease of use.
The calculations can be performed smoothly if I replace all NaN values with 1 or 0. However I am not sure if that is the best way to deal with this issue, and I was also wondering what replacement value I should pick (in general) to get the best results for performing logistic regression with missing data. Are there any benefits/drawbacks to using a particular number (0 or 1 or something else) for replacing the said missing values in my data?
Note: I have also normalized all feature values to be in the range of 0-1.
Any insight on this issue will be highly appreciated. Thank you
As pointed out earlier, this is a generic problem people deal with regardless of the programming platform. It is called "missing data imputation".
Enforcing all missing values to a particular number certainly has drawbacks. Depending on the distribution of your data it can be drastic, for example, setting all missing values to 1 in a binary sparse data having more zeroes than ones.
Fortunately, MATLAB has a function called knnimpute that estimates a missing data point by its closest neighbor.
From my experience, I often found knnimpute useful. However, it may fall short when there are too many missing sites as in your data; the neighbors of a missing site may be incomplete as well, thereby leading to inaccurate estimation. Below, I figured out a walk-around solution to that; it begins with imputing the least incomplete columns, (optionally) imposing a safe predefined distance for the neighbors. I hope this helps.
function data = dnnimpute(data,distCutoff,option,distMetric)
% data = dnnimpute(data,distCutoff,option,distMetric)
%
% Distance-based nearest neighbor imputation that impose a distance
% cutoff to determine nearest neighbors, i.e., avoids those samples
% that are more distant than the distCutoff argument.
%
% Imputes missing data coded by "NaN" starting from the covarites
% (columns) with the least number of missing data. Then it continues by
% including more (complete) covariates in the calculation of pair-wise
% distances.
%
% option,
% 'median' - Median of the nearest neighboring values
% 'weighted' - Weighted average of the nearest neighboring values
% 'default' - Unweighted average of the nearest neighboring values
%
% distMetric,
% 'euclidean' - Euclidean distance (default)
% 'seuclidean' - Standardized Euclidean distance. Each coordinate
% difference between rows in X is scaled by dividing
% by the corresponding element of the standard
% deviation S=NANSTD(X). To specify another value for
% S, use D=pdist(X,'seuclidean',S).
% 'cityblock' - City Block distance
% 'minkowski' - Minkowski distance. The default exponent is 2. To
% specify a different exponent, use
% D = pdist(X,'minkowski',P), where the exponent P is
% a scalar positive value.
% 'chebychev' - Chebychev distance (maximum coordinate difference)
% 'mahalanobis' - Mahalanobis distance, using the sample covariance
% of X as computed by NANCOV. To compute the distance
% with a different covariance, use
% D = pdist(X,'mahalanobis',C), where the matrix C
% is symmetric and positive definite.
% 'cosine' - One minus the cosine of the included angle
% between observations (treated as vectors)
% 'correlation' - One minus the sample linear correlation between
% observations (treated as sequences of values).
% 'spearman' - One minus the sample Spearman's rank correlation
% between observations (treated as sequences of values).
% 'hamming' - Hamming distance, percentage of coordinates
% that differ
% 'jaccard' - One minus the Jaccard coefficient, the
% percentage of nonzero coordinates that differ
% function - A distance function specified using #, for
% example #DISTFUN.
%
if nargin < 3
option = 'mean';
end
if nargin < 4
distMetric = 'euclidean';
end
nanVals = isnan(data);
nanValsPerCov = sum(nanVals,1);
noNansCov = nanValsPerCov == 0;
if isempty(find(noNansCov, 1))
[~,leastNans] = min(nanValsPerCov);
noNansCov(leastNans) = true;
first = data(nanVals(:,noNansCov),:);
nanRows = find(nanVals(:,noNansCov)==true); i = 1;
for row = first'
data(nanRows(i),noNansCov) = mean(row(~isnan(row)));
i = i+1;
end
end
nSamples = size(data,1);
if nargin < 2
dataNoNans = data(:,noNansCov);
distances = pdist(dataNoNans);
distCutoff = min(distances);
end
[stdCovMissDat,idxCovMissDat] = sort(nanValsPerCov,'ascend');
imputeCols = idxCovMissDat(stdCovMissDat>0);
% Impute starting from the cols (covariates) with the least number of
% missing data.
for c = reshape(imputeCols,1,length(imputeCols))
imputeRows = 1:nSamples;
imputeRows = imputeRows(nanVals(:,c));
for r = reshape(imputeRows,1,length(imputeRows))
% Calculate distances
distR = inf(nSamples,1);
%
noNansCov_r = find(isnan(data(r,:))==0);
noNansCov_r = noNansCov_r(sum(isnan(data(nanVals(:,c)'==false,~isnan(data(r,:)))),1)==0);
%
for i = find(nanVals(:,c)'==false)
distR(i) = pdist([data(r,noNansCov_r); data(i,noNansCov_r)],distMetric);
end
tmp = min(distR(distR>0));
% Impute the missing data at sample r of covariate c
switch option
case 'weighted'
data(r,c) = (1./distR(distR<=max(distCutoff,tmp)))' * data(distR<=max(distCutoff,tmp),c) / sum(1./distR(distR<=max(distCutoff,tmp)));
case 'median'
data(r,c) = median(data(distR<=max(distCutoff,tmp),c),1);
case 'mean'
data(r,c) = mean(data(distR<=max(distCutoff,tmp),c),1);
end
% The missing data in sample r is imputed. Update the sample
% indices of c which are imputed.
nanVals(r,c) = false;
end
fprintf('%u/%u of the covariates are imputed.\n',find(c==imputeCols),length(imputeCols));
end
To deal with missing data you can use one of the following three options:
If there are not many instances with missing values, you can just delete the ones with missing values.
If you have many features and it is affordable to lose some information, delete the entire feature with missing values.
The best method is to fill some value (mean, median) in place of missing value. You can calculate the mean of the rest of the training examples for that feature and fill all the missing values with the mean. This works out pretty well as the mean value stays in the distribution of your data.
Note: When you replace the missing values with the mean, calculate the mean only using training set. Also, store that value and use it to change the missing values in the test set also.
If you use 0 or 1 to replace all the missing values then the data may get skewed so it is better to replace the missing values by an average of all the other values.

Difference between the weight parameter in xgb.DMatrix and scale_pos_weight in hyper params list?

I am having a little difficulty understanding what's the difference between the weight function in xgb.DMatrix and the sum_pos_weight parameter in the param list. I am going through the following code which is using the Higgs data;
Due to the data being unbalanced, the author defines a weight parameter:
weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
sumwpos <- sum(weight * (label==1.0))
sumwneg <- sum(weight * (label==0.0))
However column 32 is already a weight variable, so the author is modifying an already defined weight variable?
Then, the modified weight variable is being set as the "weight" argument of xgb.DMatrix:
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
Additionally, in the param list the author has: "scale_pos_weight" = sumwneg / sumwpos,.
so scale_pos_weight is a function of sumneg which is a function of weight which is a function of a previously defined weight (column 32). So I am confused.
What does the author do in the following line: weight <- as.numeric(dtrain[[32]]) * testsize / length(label)
What is the difference in setting the weight in xgb.DMatrix and again in sum_pos_weight?
When you set
xgmat <- xgb.DMatrix(data, label = label, weight = weight, missing = -999.0)
weight should be a vector corresponding to your data rows
If for example you have the following data:
A B C
1 1 1 1
2 2 2 2
you need to set weight as a vector of 2 weights
weight <- c(1, 2)
So you will have a weight of 1 to the first event and weight of 2 to the 2nd event. You ask your self why is it good? Assume event 1 has happened 1 time and event 2 happened 2 times, you'd like co responsive weights to them specifically mentioning the amount of time that event has occurred.
Here are few more examples for using weights:
If you want recent events to have more "value"
The amount of confidence you have in a data row. you will set all weights to be between 0 to 1 and the weight will represent how much you sure of that data. for example if weight = 0.88 you gave that row 88% confidence
If you have repetitive events. instead of creating more rows, you can set them once and give them a weight as the number they've repeated
scale_pos_weight is usually used when you have "imbalanced data". for example, assuming you have a classification problem where you have 5% of the data as 1 and 95% of the data as 0, you would like to give more weight for every positive "event". So you can just set scale_pos_weight = 19 (or as the author wrote: sumneg/sumpos)
As for the "author" re defining weight. I cannot know without the full code what he did there, but I assume he's doing some sort of normalization to the weights.

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

How to computer Document Length and Average Document Length in BM25

Please tell me anyone as how to compute document(dl) length and average document length(avdl) in BM25. For example we have the following 4 documents:
new york times east // Doc1
los angeles times west //Doc2
washington post district columbia //Doc3
wall street journal north //Doc4
The first step is to remove stop-words and perform stemming so that we can consider a document d as a set of constituent terms with corresponding term frequencies {tf(t,d) : t \in d}.
Now, the notion of document length is slightly different in vector space and probabilistic models, e.g. BM25, language model etc. While in the former, document length refers to the norm of a vector, in the latter it typically refers to total number of terms in a document.
Nonetheless, the vector norm notion of documents can, in principle, be also applied to probabilistic models as well because the term frequency values still remain normalized between 0 and 1. However, the normalized term frequency values would no longer sum to 1.
To illustrate with your example: In the case of vector space model, the length is defined as the norm of a vector, which is the case of doc1, is norm(doc1) = square root of the sum of squares of the term frequency values for each unique term in doc1 = sqrt(1^2 + 1^2 + 1^2 + 1^2) = sqrt(4) = 2.
For the probabilistic models, length would be defined as summation of term frequencies of the component terms = 1 + 1 + 1 + 1 = 4. The normalized term frequency values of a term t would be P(t,d) = tf(t,d)/dl(d) so that \sum{P(t,d) t \in d} = 1, e.g. 1/4+1/4+1/4+1/4=1.
The BM25Similarity implementation of Lucene uses vector norms as document lengths whereas the Terrier uses sum of tfs of constituent terms as document lengths.

Resources