How to deal with array of string features in traditional machine learning?

How to deal with array of string features in traditional machine learning? - machine-learning

Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?

As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.
For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).
I'll also throw a hashing variant into the mix:
Bloom Filter
You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.

First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.
Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.
Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.
Hope this helps!

Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.
Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).

Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.

Related

Error: "DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")" using Flux in Julia

i'm still new in Julia and in machine learning in general, but I'm quite eager to learn. In the current project i'm working on I have a problem about dimensions mismatch, and can't figure what to do.
I have two arrays as follow:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
and the next model using Flux:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
I zip both arrays, and then feed them into the model using Flux.train!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
and immediately throws the next error:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Now, I know that the first dimension of matrix A is the sum of the hidden layers (256 + 256 + 128 + 128 + 128 + 128) and the second dimension is the input layer, which is 10. The first thing I did was change the 10 for a 9, but then it only throws the error:
ERROR: DimensionMismatch("dimensions must match")
Can someone explain to me what dimensions are the ones that mismatch, and how to make them match?

Introduction
First off, you should know that from an architectural standpoint, you are asking something very difficult from your network; softmax re-normalizes outputs to be between 0 and 1 (weighted like a probability distribution), which means that asking your network to output values like 77 to match y will be impossible. That's not what is causing the dimension mismatch, but it's something to be aware of. I'm going to drop the softmax() at the end to give the network a fighting chance, especially since it's not what's causing the problem.
Debugging shape mismatches
Let's walk through what actually happens inside of Flux.train!(). The definition is actually surprisingly simple. Ignoring everything that doesn't matter to us, we are left with:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Therefore, let's start by pulling the first element out of your data, and splatting it into your loss function. You didn't specify your loss function or optimizer in the question. Although softmax usually means you should use crossentropy loss, your y values are very much not probabilities, and so if we drop the softmax we can just use the dead-simple mse() loss. For optimizer, we'll default to good old ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Now, to simulate the first run of Flux.train!(), we take first(data) and splat that into loss():
loss(first(data)...)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12"). Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Understanding LSTMs
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM network a vector of length 10 (well, 12 now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM that takes in a dimensionality of 3, then feed it 1000 times.
Writing our own training loop
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y) function to, instead of calling model(x), it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!() the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Feeding this into Flux.train!() actually trains, albeit not very well. ;)
Final observations
Although your data is all Int64's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1] and moreover that they sum to 1.0, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101] into something where you can say "we have 99.1% certainty that the second class is correct, and 0.7% certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1 vector through (where S is the size of your embedding) you're instead going to be passing through an SxN matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!

How does correlation work for an even-sized filter in this example?

(I know a question like this exists, but I wanted help with a specific example)
If the linear filter has even dimensions, how is the "center" defined? i.e. in the following scenario:
filter = np.array([[a, b],
[c, d]])
and the image was:
image = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
what would be the result of correlation of the image with the linear filter?

Which of the elements of the even-sized filter is considered the origin is an arbitrary choice. Each implementation will make a different choice. Though a and d are the two most likely choices for reasons of similarity of the two image dimensions.
For example, MATLAB's imfilter (which implements correlation, not convolution) does the following:
f = [1,2;4,8];
img = [0,1,0;1,1,1;0,1,0];
imfilter(img,f,'same')
ans =
14 13 4
11 7 1
2 1 0
meaning that a is the origin of the kernel in this case. Other implementations might make a different choice.

LSTM, pattern and noise gap

I want to find sequence patterns in a time series with random noise gap.
For example, this is the pattern I wan to find:
1, 2, 3, 4
But, my samples are:
*1*, 10, *2*, *3*, 11, 12, *4*
*1*, *2*, 10, 14, 15, *3*, 10, 13, *4*
10, *1*, 10, 10, 10, *2*, 11, 12, *3*, *4*
I don't know that the "good" elements are 1, 2, 3 and 4.
I started with a LSTM decoder, but "the noise" hide the good elements. For example, with the 3 samples, I get:
*1*, 10, 13, 10, ...
and 2, 3 and 4 are hidden
Have you an idea to find those patterns ?
Thanks.
Frédéric

As a starting point you can use a sequence-to-sequence (seq2seq) model. The linked repo has nice explanation how these models work and what type of problems they cover. The crucial point would be how to encode your sequence. Often they are encoded as one-hot vectors. So if you have a fixed upper bound on the number of distinct numbers/items in your sequence you can use it.
Instead of generating a new sequence without noise from the original one, you can also try to classify each point as noise or not and eliminate those classified as noise as your output. Something along the lines of:
seq = Input(shape=(timesteps, features))
hidden = LSTM(HIDDEN_UNITS, return_sequences=True)(seq)
out = TimeDistributed(Dense(1, activation='sigmoid'))(hidden)
You will have to know before hand whether each data point is noise or not.

How to convert a single word into vector for neural network input

I am new to ML and thinking about one simple hello world problem as a practice of using ANN. Here is the problem, say I have a training data set with English name and its corresponding gender:
ALEX,M
BONNIE,F
CARLO,F
DAVID,M
EDDY,M
...
I would like to build a model to predict the gender of name. Since the input to ANN must be a form of vector, I am thinking to convert a name into a vector with a number of features same as the longest name in the data set (i.e. 10) and then put A=1, B=2, ... , Z=26, and null=-1 to the vector.
For example:
ALEX will be [1, 12, 5, 24, -1, -1, -1, -1, -1, -1]
The output layer will be {0, 1} which represent either male or female.
It sounds quite strange to me. Is it a good way to feed a single word into ANN like this?

Use one-hot encoding. This means you have a large input size, but it has proven to work. Using A=1, B=2, Z=26 gives the network the impression that B is closer to A than Z is to A, and it requires a lot of hidden nodes to map the function.
With one-hot encoding:
A: [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
B: [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
Z: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1]
none: [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
So it requires 26 inputs per letter.

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?

Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.

What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart