Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?
As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.
For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).
I'll also throw a hashing variant into the mix:
Bloom Filter
You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.
First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.
Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.
Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.
Hope this helps!
Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.
Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).
Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.
I have a Poisson distribution that looks similar to the one below:
https://i0.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/poisson-distribution-chart.png
I've been asked to find the mean, and then the three logical groups above and below the mean for a total of seven groups.
Were this a normal distribution where the min was 0, max was 12 and mean was 6, the logical groups might be:
-3: 1
-2: 2.666
-1: 4.333
0: 6
1: 7.666
2: 9.333
3: 11
But with a Poisson distribution (such as the image above), I would expect it to be more like:
-3: 0.625
-2: 1.25
-1: 1.875
0: 2.5
1: 4.25
2: 6.5
3: 10
Is there a faster way of looking for where these points would be than eyeballing it? I need to do this with more than a hundred histograms...
I apologize if I have the language wrong; this is my first time doing something like this.
Imagine that you need 7 bins that store the values you need.
For Poisson Distribution, the mean is the lambda itself, which in your case is 3. So bin[3] = 3
Consider the formula:
bins = []
for n = min to groups + min: # typically it is 0 to groups - 1
bins[n] = min + range * n / groups
Now you need 2 different ranges:
n = 0 to 2, min = 0, max = 3, range = (3 - 0) = 3, groups = 3
n = 4 to 6, min = 3, max = 12, range = (12 - 3) = 9, groups = 3
You may apply the values in above formula to get your bins.
HTH. My memory is little out of practice, but I think general idea is correct.
Edit: This might not work for Poisson distribution. Poisson is a Discrete type distribution while my solution works only for Continuous distributions. I will leave my answer here anyways.
I made a survey where users could vote on a subject. They were allowed to either yay it (+1) , nay it (–1) or don't care (0).
I only have the aggregate results in Google Sheets like
yay nay dontcare
Option A: 32 14 23
Option B: 12 37 20
Option C: 40 17 12
Option D: 64 3 2
The number of votes are always the same on every option.
Now I need to find out how controversial the answers are. I thought about STDEVP, but I do not have a list of cells, just the aggregates.
How do I find the standard deviation here with Google Sheets?
Assuming you ignore don't care's you can just take the prevalence of yay's and use sd=sqrt(p(1-p))
so if yay's are in column B, nays in C you use
=SQRT(B2/SUM(B2:C2) * (C2/SUM(B2:C2)))
Note that this is the standard deviation for a population.
If you want to include them you can use calculate the mean in E2 with
=SUMPRODUCT(B2:D2, {1, -1, 0}) / SUM(B2:D2)
Then you can calculate variance like this in F2
=SUMPRODUCT(ArrayFormula({1, -1, 0}-E2)^2, B2:D2) / (SUM(B2:D2)-1)
which is just taking every 1, -1, or 0 reduces by the mean, squares this deviation it and takes the average -1 degree of freedom (for the sample, leave the -1 out if you assume you have the population).
The Standard deviation is
=SQRT(F2)
I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?
For example, two-digit number have 4 combinations: 11, 12, 21, 22. Three-digit number have 8 combinations: 111, 112,...222.
How to get number of combinations for number that have 4, 5, ... 10 or more digits?
Thanks
P.S. This refers to the Delphi :)
The answer is 2N, where N is the number of digits.
This is a purely mathematical problem, and concerns very basic combinatorics. It is easy to see why 2N is the right answer. Indeed, there are two ways to choose the first digit. For each such choice, there are two ways to chose the second digit. Hence, there are 2×2 ways to chose a two-digit number. For each such number, there are two ways to add a third digit, making 2×2×2 ways to construct a three-digit number. Hence, there are
2 × 2 × ... × 2 = 2^N
ways to construct a N-digit number.
In Delphi, you compute 2N by Power(2, N) (uses Math). [A less naïve way, which works for N < 31, is 1 shl N.]