Many-state nominal variables modelling - machine-learning

I was reading about neural networks and found this:
"Many-state nominal variables are more difficult to handle. ST Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network. Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although unsatisfactory) to model the nominal variable using a single numeric index; a better approach is to look for a different way to represent the information."
This is exactly what is happening when I am building my input layer. One-of-N encoding is making the model very complex to design. However, it is mentioned in the above that you can use a numeric index, which I am not sure what he/she means by it. What is a better approach to represent the information? Can neural networks solve a problem with many-state nominal variables?
References:
http://www.uta.edu/faculty/sawasthi/Statistics/stneunet.html#gathering

Solving this task is very often crucial for modeling. Depending on a complexity of distribution of this nominal variable it'seems very often truly important to find a proper embedding between its values and R^n for some n.
One of the most successful example of such embedding is word2vec where the function between words and vectors is obtained. In other cases - you should use either ready solution if it exists or prepare your own by representational learning (e.g. by autoencoders or RBMs).

Related

Machine learning, Do unbalanced non-numeric variable classes matter

If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?
For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?
Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.
Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.
I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.
An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.
Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.

How to deal with Qualitative Data in machine learning algorithms

Suppose I'm trying to use a neural network to predict how long my run will take. I have a lot of data from past runs. How many miles I plan on running, the total change in elevation (hills), the temperature, and the weather: sunny, overcast, raining, or snowing.
I'm confused on what to do with the last piece of data. For everything else I can input normally after standardizing, but I can't do that for the weather. My initial though was just to have 4 extra variables, one for each type of weather, and input put a 1 or a 0 depending on what it is.
Is this a good approach to the situation? are there other approaches I should try?
You have a categorical variable that has four levels.
A very typical way of encoding such values is to use a separate variable for each one. Or, more commonly, "n-1" coding, where one less flag is used (the fourth value is represented by all being 0).
n-1 coding is used for techniques that require numeric inputs -- including logistic regression and neural networks. For large values of "n", then it is a bad choice. The problem is that it creates many inputs of sparse data; sparse data is highly correlated. More inputs mean more degrees of freedom in the network, making the network harder to train.
In your case, you only have four values for this particular input. Splitting it into three variables is probably reasonable.

Machine learning with a variable-sized real vector of inputs?

I have a collection of objects with properties that I measure. For each object, I obtain a vector of real numbers describing that object. The vector is always incomplete: there's usually numbers missing from the beginning or end of what would be the complete vector, and sometimes there is information missing in the middle. Hence, each object results in a vector of a different length. I also measure, say, the mass of each object, and I now want to relate the vector of things I've measured to the mass.
It's common in my field (astrophysics) to extract features from this vector of real numbers, e.g. take an average or some linear combinations of the values; and then use those extracted features to infer the mass (or whatever) using for example neural networks. It was recently shown, however, that a very complex combination of the elements of the vector result in a much better model of the mass.
There are still residuals in this model, however, even when working on simulated data. Presumably then there is a better way out there to manipulate these variable-length vectors in order to get a better model.
I am wondering if it is possible to do machine learning with real-valued input vectors of all different lengths. I know for text mining there are things like the bag-of-words approach, but it is unclear how such a method would work on real-valued vectors. I know recurrent neural networks work on sentences of variable length, but I'm not sure they work for real-valued vectors. I have also considered imputing the missing data; however, sometimes it is missing for physical reasons, i.e. a value in such-and-such place cannot exist, and so imputing it would violate the physicality of the situation.
Is there any research in this area?
Recurrent Neural Networks (RNNs) are capable of taking a variable-sized input vector of length n and producing a variable sized output vector of length m.
There are many ways to make RNNs work. The most common cell types are called Long short-term memory (LSTM) and Gated Recurrent Unit (GRU).
You might want to read:
The Unreasonable Effectiveness of Recurrent Neural Networks: Nice to get an idea what RNNs are capable of, especially character predictors. It is easy to read, but not exactly what you're searching.
Understanding LSTM Networks: More technical; very well written
Sepp Hochreiter, Jurgen Schmidhuber: LONG SHORT-TERM MEMORY
RNNs in TensorFlow
However, training RNNs takes a lot of training data. You might be better off with computing a fixed-size feature vector from it. But you never know when you don't try it ;-)

Nominal valued dataset in machine learning

What's the best way to use nominal value as opposed to real or boolean ones for being included in a subset of feature vector for machine learning?
Should I map each nominal value to real value?
For example, if I want to make my program to learn a predictive model for an web servie users whose input features may include
{ gender(boolean), age(real), job(nominal) }
where dependent variable may be the number of web-site login.
The variable job may be one of
{ PROGRAMMER, ARTIST, CIVIL SERVANT... }.
Should I map PROGRAMMER to 0, ARTIST to 1 and etc.?
Do a one-hot encoding, if anything.
If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests.
If you read the book called "Machine Learning with Spark", the author
wrote,
Categorical features
Categorical features cannot be used as input in their raw form, as they are not
numbers; instead, they are members of a set of possible values that the variable can take. In the example mentioned earlier, user occupation is a categorical variable that can take the value of student, programmer, and so on.
:
To transform categorical variables into a numerical representation, we can use a
common approach known as 1-of-k encoding. An approach such as 1-of-k encoding
is required to represent nominal variables in a way that makes sense for machine
learning tasks. Ordinal variables might be used in their raw form but are often
encoded in the same way as nominal variables.
:
I had exactly the same thought.
I think that if there is a meaningful(well-designed) transformation function that maps categorical(nominal) to real values, I may also use learning algorithms that only takes numerical vectors.
Actually I've done some projects where I had to do that way and
there was no issue raised concerning the performance of learning system.
To someone who took a vote against my question,
please cancel your evaluation.

extrapolation with recurrent neural network

I Wrote a simple recurrent neural network (7 neurons, each one is initially connected to all the neurons) and trained it using a genetic algorithm to learn "complicated", non-linear functions like 1/(1+x^2). As the training set, I used 20 values within the range [-5,5] (I tried to use more than 20 but the results were not changed dramatically).
The network can learn this range pretty well, and when given examples of other points within this range, it can predict the value of the function. However, it can not extrapolate correctly and predicting the values of the function outside the range [-5,5]. What are the reasons for that and what can I do to improve its extrapolation abilities?
Thanks!
Neural networks are not extrapolation methods (no matter - recurrent or not), this is completely out of their capabilities. They are used to fit a function on the provided data, they are completely free to build model outside the subspace populated with training points. So in non very strict sense one should think about them as an interpolation method.
To make things clear, neural network should be capable of generalizing the function inside subspace spanned by the training samples, but not outside of it
Neural network is trained only in the sense of consistency with training samples, while extrapolation is something completely different. Simple example from "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8" shows how NN behave in this context
All of these networks are consistent with training data, but can do anything outside of this subspace.
You should rather reconsider your problem's formulation, and if it can be expressed as a regression or classification problem then you can use NN, otherwise you should think about some completely different approach.
The only thing, which can be done to somehow "correct" what is happening outside the training set is to:
add artificial training points in the desired subspace (but this simply grows the training set, and again - outside of this new set, network's behavious is "random")
add strong regularization, which will force network to create very simple model, but model's complexity will not guarantee any extrapolation strength, as two model's of exactly the same complexity can have for example completely different limits in -/+ infinity.
Combining above two steps can help building model which to some extent "extrapolates", but this, as stated before, is not a purpose of a neural network.
As far as I know this is only possible with networks which do have the echo property. See Echo State Networks on scholarpedia.org.
These networks are designed for arbitrary signal learning and are capable to remember their behavior.
You can also take a look at this tutorial.
The nature of your post(s) suggests that what you're referring to as "extrapolation" would be more accurately defined as "sequence recognition and reproduction." Training networks to recognize a data sequence with or without time-series (dt) is pretty much the purpose of Recurrent Neural Network (RNN).
The training function shown in your post has output limits governed by 0 and 1 (or -1, since x is effectively abs(x) in the context of that function). So, first things first, be certain your input layer can easily distinguish between negative and positive inputs (if it must).
Next, the number of neurons is not nearly as important as how they're layered and interconnected. How many of the 7 were used for the sequence inputs? What type of network was used and how was it configured? Network feedback will reveal the ratios, proportions, relationships, etc. and aid in the adjustment of network weight adjustments to match the sequence. Feedback can also take the form of a forward-feed depending on the type of network used to create the RNN.
Producing an 'observable' network for the exponential-decay function: 1/(1+x^2), should be a decent exercise to cut your teeth on RNNs. 'Observable', meaning the network is capable of producing results for any input value(s) even though its training data is (far) smaller than all possible inputs. I can only assume that this was your actual objective as opposed to "extrapolation."

Resources